End-to-End Automatic Speech Translation of Audiobooks

Alexandre Bérard, Laurent Besacier, Ali Can Kocabiyikoglu, Olivier Pietquin

Introduction

Most spoken language translation (SLT) systems integrate (loosely or closely) two main modules: source language speech recognition (ASR) and source-to-target text translation (MT). In these approaches, a symbolic sequence of words (or characters) in the source language is used as an intermediary representation during the speech translation process. However, recent works have attempted to build end-to-end speech-to-text translation without using source language transcription during learning or decoding. One attempt to translate directly a source speech signal into target language text is that of . However, the authors focus on the alignment between source speech utterances and their text translation without proposing a complete end-to-end translation system. The first attempt to build an end-to-end speech-to-text translation system (which does not use source language) is our own work but it was applied to a synthetic (TTS) speech corpus. A similar approach was then proposed and evaluated on a real speech corpus by .

This paper is a follow-up of our previous work . We now investigate end-to-end speech-to-text translation on a corpus of audiobooks - LibriSpeech - specifically augmented to perform end-to-end speech translation . While previous works investigated the extreme case where source language transcription is not available during learning nor decoding (unwritten language scenario defined in ), we also investigate, in this paper, a midway case where a certain amount of source language transcription is available during training. In this intermediate scenario, a unique (end-to-end) model is trained to decode source speech into target text through a single pass (which can be interesting if compact speech translation models are needed).

This paper is organized as follows: after presenting our corpus in section 2, we present our end-to-end models in section 3. Section 4 describes our evaluation on two datasets: the synthetic dataset used in and the audiobook dataset described in section 2. Finally, section 5 concludes this work.

Audiobook Corpus for End-to-End Speech Translation

Large quantities of parallel texts (e.g. Europarl or OpenSubtitles) are available for training text machine translation systems, but there are no large (>100h) and publicly available parallel corpora that include speech in a source language aligned to text in a target language. The Fisher/Callhome Spanish-English corpora are only medium size (38h), contain low-bandwidth recordings, and are not available for free.

We very recently built a large English to French corpus for direct speech translation training and evaluation The Augmented LibriSpeech corpus is available for download here: https://persyval-platform.univ-grenoble-alpes.fr/DS91/detaildataset, which is much larger than the existing corpora described above. We started from the LibriSpeech corpus used for Automatic Speech Recognition (ASR), which has 1000 hours of speech aligned with their transcriptions .

The read audiobook recordings derive from a project based on a collaborative effort: LibriVox. The speech recordings are based on public domain books available on Gutenberg Projecthttps://www.gutenberg.org/ which are distributed in LibriSpeech along with the recordings.

Our augmentation of LibriSpeech is straightforward: we automatically aligned e-books in a foreign language (French) with English utterances of LibriSpeech. This lead to 236 hours of English speech aligned to French translations at utterance level (more details can be found in ). Since English (source) transcriptions are initially available for LibriSpeech, we also translated them using Google Translate. To summarize, for each utterance of our 236h corpus, the following quadruplet is available: English speech signal, English transcription, French text translation 1 (from alignment of e-books) and translation 2 (from MT of English transcripts).

2 MT and AST tasks

This paper focuses on the speech translation (AST) task of audiobooks from English to French, using the Augmented LibriSpeech corpus. We compare a direct (end-to-end) approach, with a cascaded approach that combines a neural speech transcription (ASR) model with a neural machine translation model (MT). The ASR and MT results are also reported as baselines for future uses of this corpus.

Augmented LibriSpeech contains 236 hours of speech in total, which is split into 4 parts: a test set of 4 hours, a dev set of 2 hours, a clean train set of 100 hours, and an extended train set with the remaining 130 hours. Table 1 gives detailed information about the size of each corpus. All segments in the corpus were sorted according to their alignment confidence scores, as produced by the alignment software used by the authors of the corpus . The test, dev and train sets correspond to the highest rated alignments. The remaining data (extended train) is more noisy, as it contains more incorrect alignments. The test set was manually checked, and incorrect alignments were removed. We perform all our experiments using train only (without extended train). Furthermore, we double the training size by concatenating the aligned references with the Google Translate references. We also mirror our experiments on the BTEC synthetic speech corpus, as a follow-up to .

End-to-End Models

For the three tasks, we use encoder-decoder models with attention . Because we want to share some parts of the model between tasks (multi-task training), the ASR and AST models use the same encoder architecture, and the AST and MT models use the same decoder architecture.

This model differs from , which did not use convolutions, but time pooling between each LSTM layer, resulting in a shorter sequence (pyramidal encoder).

2 Character-level decoder

We use a character-level decoder composed of a conditional LSTM , followed by a dense layer.

Experiments

Speech files were preprocessed using Yaafe , to extract 40 MFCC features and frame energy for each frame with a step size of 10 ms and window size of 40 ms, following . We tokenize and lowercase all the text, and normalize the punctuation, with the Moses scriptshttp://www.statmt.org/moses/. For BTEC, the same preprocessing as is applied. Character-level vocabularies for LibriSpeech are of size $46$ for English (transcriptions) and $167$ for French (translation). The decoder outputs are always at the character-level (for AST, MT and ASR). For the MT task, the LibriSpeech English (source) side is preprocessed into subword units . We limit the number of merge operations to $30k$ , which gives a vocabulary of size $27k$ . The MT encoder for BTEC takes entire words as input.

Our BTEC models use an LSTM size of $m=m^{\prime}=256$ , while the LibriSpeech models use a cell size of $512$ , except for the speech encoder layers which use a cell size of $m=256$ in each direction. We use character embeddings of size $k=64$ for BTEC, and $k=128$ for LibriSpeech. The MT encoders are more shallow, with a single bidirectional layer. The source embedding sizes for words (BTEC) and subwords (LibriSpeech) are respectively $128$ and $256$ .

The input layers in the speech encoders have a size of $256$ for the first layer and $n^{\prime}=128$ for the second. The LibriSpeech decoders use an output layer size of $l=512$ . For BTEC, we do not use any non-linear output layer, as we found that this led to overfitting.

2 Training settings

We train our models with Adam , with a learning rate of $0.001$ , and a mini-batch size of 64 for BTEC, and 32 for LibriSpeech (because of memory constraints). We use variational dropout , i.e., the same dropout mask is applied to all elements in a batch at all time steps, with a rate of $0.2$ for LibriSpeech and $0.4$ for BTEC. In the MT tasks, we also drop source and target symbols at random, with probability $0.2$ . Dropout is not applied on recurrent connections .

We train all our models on LibriSpeech train augmented with the Google Translate references, i.e., the source side of the corpus (speech) is duplicated, and the target side (translations) is a concatenation of the aligned references with the Google Translate references. Because of GPU memory limits, we set the maximum length to $1400$ frames for LibriSpeech input, and $300$ characters for its output. This covers about $90\%$ of the training corpus. Longer sequences are kept but truncated to the maximum size. We evaluate our models on the dev set every 1000 mini-batch updates using BLEU for AST and MT, and WER for ASR, and keep the best performing checkpoint for final evaluation on the test set.

Our models are implemented with TensorFlow as part of the LIG-CRIStAL NMT toolkitThe toolkit and configuration files are available at: https://github.com/eske/seq2seq.

3 Results

Table 2 presents the results for the ASR and MT tasks on BTEC and LibriSpeech. The MT task (and by extension the AST task) on LibriSpeech (translating novels) looks particularly challenging, as we observe BLEU scores around 20%Google Translate is also scored as a topline (22.2%)..

For Automatic Speech Translation (AST), we try four settings. The cascaded model combines both the ASR and MT models (as a pipeline). The end-to-end model (described in section 3) does not make any use of source language transcripts. The pre-trained model is identical to end-to-end, but its encoder and decoder are initialized with our ASR and MT models. The multi-task model is also pre-trained, but continues training for all tasks, by alternating updates like , with 60% of updates for AST and 20% for ASR and MT.

Table 3 and 4 present the results for the end-to-end AST task on BTEC and LibriSpeech. On both corpora, we show that: (1) it is possible to train compact end-to-end AST models with a performance close to cascaded models; (2) pre-training and multi-task learningif source transcriptions are available at training time improve AST performance; (3) contrary to , in both BTEC and LibriSpeech settings, best AST performance is observed when a symbolic sequence of symbols in the source language is used as an intermediary representation during the speech translation process (cascaded system); (4) finally, the AST results presented on LibriSpeech demonstrate that our augmented corpus is useful, although challenging, to benchmark end-to-end AST systems on real speech at a large scale. We hope that the baseline we established on Augmented LibriSpeech will be challenged in the future.

The large improvements on MT and AST on the BTEC corpus, compared to are mostly due to our use of a better decoder, which outputs characters instead of words.

4 Analysis

Figure 1 shows the evolution of BLEU and WER scores for MT and ASR tasks with single models, and when we continue training them as part of a multi-task model. The multi-task procedure does more updates on AST, which explains the degraded results, but we observe that the speech encoder and text decoder are still able to generalize well to other tasks.

Figure 2 shows the evolution of dev BLEU scores for our three AST models on LibriSpeech. We see that pre-training helps the model converge much faster. Eventually, the End-to-End system reaches a similarly good solution, but after three times as many updates. Multi-Task training does not seem to be helpful when combined with pre-training.

Conclusion

We present baseline results on End-to-End Automatic Speech Translation on a new speech translation corpus of audiobooks, and on a synthetic corpus extracted from BTEC (follow-up to ). We show that, while cascading two neural models for ASR and MT gives the best results, end-to-end methods that incorporate the source language transcript come close in performance.