fairseq S2T: Fast Speech-to-Text Modeling with fairseq

Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino

Introduction

End-to-end sequence-to-sequence (S2S) modeling has witnessed rapidly increased applications in speech-to-text (S2T) tasks. It achieves state-of-the-art performance on automatic speech recognition (ASR) (Park et al., 2019; Synnaeve et al., 2019) and leads to the recent resurgence of speech-to-text translation (ST) research (Duong et al., 2016; Bérard et al., 2016). ASR and ST are closely related. There are recent attempts to combine the two tasks under the same S2S model architecture via multi-task learning (Anastasopoulos and Chiang, 2018; Liu et al., 2020). They also benefit from each other via transfer learning (Bansal et al., 2019; Wang et al., 2020b) and are able to leverage additional supervision from machine translation (MT) and language modeling (LM). When supervised data is not abundant, self-supervised pre-training (Schneider et al., 2019; Wu et al., 2020) and semi-supervised training (Kahn et al., 2020; Pino et al., 2020) lowers the requirements on supervision and improves model performance.

The increased connections among ASR, ST, MT and LM has called for all-in-one S2S modeling toolkits, and the use of large-scale unlabeled speech data sets the scalability requirements. In this paper, we introduce fairseq S2T, a fairseq (Ott et al., 2019) extension for S2T tasks such as end-to-end ASR and ST. It follows fairseq’s careful design for scalability and extensibility. We provide end-to-end workflows from data pre-processing, model training to offline (online) inference. We implement state-of-the-art RNN-based (Chan et al., 2016; Bérard et al., 2018), Transformer-based (Vaswani et al., 2017; Mohamed et al., 2019) and Conformer-based (Gulati et al., 2020) models and open-source detailed training recipes. fairseq’s MT models and LMs can be seamlessly integrated into S2T workflows for multi-task learning or transfer learning. To facilitate model evaluation, we add a collection of scorers as well as VizSeq (Wang et al., 2019) integration for visualized error analysis. fairseq S2T documentation and examples are available at https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text.

With counterpart toolkits such as ESPNet (Inaguma et al., 2020) and Lingvo (Shen et al., 2019), fairseq S2T pursues the best integration, scalability and reproducibility. A detailed comparison of fairseq S2T with its counterparts can be found in Table 1.

Features

fairseq provides a collection of MT models (Ng et al., 2019; Lewis et al., 2020) and LMs (Liu et al., 2019; Conneau et al., 2020) that demonstrate state-of-the-art performance on standard benchmarks. They are open-sourced with pre-trained models. fairseq also supports other tasks such as text summarization, story generation and self-supervised speech pre-training.

fairseq S2T adds attention-based RNN models (Chan et al., 2016; Bérard et al., 2018), Transformer models (Vaswani et al., 2017; Mohamed et al., 2019) as well as the latest Conformer models (Gulati et al., 2020) for ASR and ST. It also supports CTC criterion (Graves et al., 2006) for ASR. For the simultaneous ST setting, it includes online models with widely used policies: monotonic attention (Raffel et al., 2017), wait- $k$ (Ma et al., 2019), monotonic infinite lookback attention (Arivazhagan et al., 2019b), and monotonic multihead attention (Ma et al., 2020b).

fairseq S2T extracts Kaldi-compliant (Povey et al., 2011) speech features (e.g. log mel-filter banks) automatically from WAV/FLAC audio files via PyKaldi (Can et al., 2018) or torchaudiohttps://github.com/pytorch/audio. Speech features can also be pre-computed and stored in NumPy (Harris et al., 2020) format. Optionally, raw audio files or features files can be packed into ZIP archives to improve I/O performance or facilitate file management. For further pre-processing, fairseq S2T provides online speech data transforms, including CMVN (cepstral mean and variance normalization), speed perturbation (Ko et al., 2017) and SpecAugment (Park et al., 2019). It also has an open interface for user-defined transforms. For text data, fairseq S2T does online tokenization with a rich collection of tokenizers, including Moseshttps://github.com/moses-smt/mosesdecoder, SentencePiece (Kudo and Richardson, 2018), subword-nmthttps://github.com/rsennrich/subword-nmt, byte-level BPE (Wang et al., 2020a) and bytes (Li et al., 2019).

fairseq S2T gets raw audio (feature) paths and target texts from manifest files in TSV (tab-separated values) format, which is similar to Kaldi-style scp files. Online speech data transforms and other data-related settings (e.g. tokenizer type and vocabulary) are defined by a separate configuration file in YAML format.

fairseq is implemented in PyTorch (Paszke et al., 2019) and it provides efficient batching, mixed precision training (Micikevicius et al., 2018), multi-GPU as well as multi-machine training for computational efficiency on large-scale experiments.

fairseq S2T provides common automatic metrics for ASR, ST and MT, including WER (word error rate), BLEU (Papineni et al., 2002) and chrF (Popović, 2015). It also integrates simuleval (Ma et al., 2020a) for simultaneous ST/MT metrics such as AL (average lagging) (Ma et al., 2019) and DAL (differentiable average Lagging) (Cherry and Foster, 2019).

fairseq supports Tensorboardhttps://github.com/tensorflow/tensorboard for monitoring holistic metrics during model training. It also has VizSeq (Wang et al., 2019) integration for sequence-level error analysis, where speech and target/predicted text data are visualized with alignments in Jupyter Notebook interface.

Experiments

We evaluate fairseq S2T models on English ASR benchmark—LibriSpeech (Panayotov et al., 2015), as well as multilingual ST benchmarks—MuST-C (Di Gangi et al., 2019a) and CoVoST 2 (Wang et al., 2020c). The model architectures used in benchmarking can be found in Table 3.

For speech inputs, we extract 80-channel log mel-filter bank features (25ms window size and 10ms shift) with utterance-level CMVN applied. We remove training samples with more than 3,000 frames for GPU memory efficiency. To alleviate overfitting, we pre-train ST model encoders on English ASR and adopt SpecAugment (without time warping): LD policy on LibriSpeech models and LB policy on MuST-C and CoVoST 2 models. We average the last 10 checkpoints and use a beam size of 5 for decoding. For ASR, we use 10K unigram vocabulary (Kudo and Richardson, 2018) and report WER. For ST, we use character vocabulary for CoVoST 2 and 8K unigram vocabulary for MuST-C. We report case-sensitive detokenized BLEU using sacreBLEU Post (2018), except for Japanese and Chinese translations (no word segmentation) where we report character-level BLEU.

2 Speech Recognition (ASR)

LibriSpeech is a de-facto standard ASR benchmark that contains 1,000 hours of English speech from audiobooks. Table 4 shows the dev and test WER of our models on LibriSpeech clean and noisy sets. Three architectures, RNN-based model (“B-Big”), Transformer-based models (“T-Sm”, “T-Md” and “T-Lg”) and Conformer-based wav2vec 2.0 model (“CW-Lg”), are evaluated. We can see that the first two architectures are able to achieve competitive performance (WER) to the state-of-the-art ones, while we use only default model hyper-parameters and learning rate schedule without any task-specific tuning. Our implementation of the third architecture matches the state of the art.

3 Speech Translation (ST)

MuST-C contains up to around 500 hours of English speech from TED talks with translations in 8 European languages. Table 2 shows the test BLEU of our Transformer-based models (“T-Sm” and “Multi. T-Md”) and RNN-based models (“B-Base”) on all the MuST-C language directions. Compared with previous Transformer-based approaches Di Gangi et al. (2019b); Inaguma et al. (2020), our bilingual models achieve comparative results to the state of the art without applying additional techniques such as speed perturbation and pre-trained decoder from MT. Moreover, our multilingual model (trained on all 8 languages) outperforms all bilingual ones with large margins. Besides traditional offline models, we also provide simultaneous ST models: the lower section in Table 2 presents the online models with wait- $k$ policy, which was the baseline system in the IWSLT 2020 shared task on simultaneous ST (Ansari et al., 2020). The results represent the best systems in high ( $\textrm{AL}>6$ ), medium ( $6\geq\textrm{AL}>3$ ) and low ( $\textrm{AL}\leq 3$ ) latency regimes, on which we can clearly see the trade-offs between model performance and prediction latency.

3.2 CoVoST 2

CoVoST 2 contains total 2,880 hours of read speech in 22 languages from the open-source community, with 21 X-En directions and 15 En-X directions. We evaluate our models bidirectionally on 13 languages of them, including low-resource X-En directions: Zh, Tr, Ar, Sv, Lv, Sl, Ta, Ja, Id and Cy. We observe from Table 5 that our Transformer-based models (“T-Sm” and “T-Md”) outperforms RNN-based ones (“B-Base” and “B-Big”) on all En-X and X-En directions. The performance gap tends to be larger when the training data is higher resource (En-X directions, Fr-En, De-En and Es-En). Our multilingual models perform reasonably well with a universal model for over 15 X-En or En-X directions. They even have significant improvements on some directions (e.g. at least 4 BLEU gain on Es-En). For low-resource directions, we also evaluate self-supervised speech features (Schneider et al., 2019; Wu et al., 2020)From a wav2vec model pre-trained on LibriSpeech: https://github.com/pytorch/fairseq/tree/master/examples/wav2vec as an alternative to the traditional log mel-filter bank features (“+ SSL”). We find that self-supervised features bring consistent gains and transfer well across different languages (self-supervised model trained on English and feature extracted for non-English).

Conclusion

We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as speech recognition and speech translation. It includes end-to-end workflows and state-of-the-art models with scalablity and extensibility design. It seamlessly integrates fairseq’s machine translation models and language models to improve S2T model performance. fairseq S2T documentation and examples are available at https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text.

Acknowledgments

We thank Myle Ott, Michael Auli, Alexei Baevski, Jiatao Gu, Abdelrahman Mohamed and Javad Dousti for helpful discussions.