ESPnet: End-to-End Speech Processing Toolkit

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai

cs.CL

Introduction

Automatic speech recognition (ASR) becomes a mature technology with a lot of research and development efforts mainly in speech processing communities. Especially, these efforts have been driven by popular products including Google voice search, Amazon Alexa, and Apple Siri and open source activities including Kaldi , HTK , Sphinx , Julius , RASR in addition to general research activities. These open source toolkits include feature extraction, acoustic modeling based on a hidden Markov model (HMM), Gaussian mixture model, and deep neural network (DNN), and decodingLanguage modeling is often performed by external language model toolkits, for example SRILM , and these enable us to use a full set of state-of-the-art ASR research and development achievement.

This paper describes a new open source toolkit named ESPnet (End-to-end speech processing toolkit), which aims to provide a neural end-to-end platform for ASR and other speech processing. Unlike the above open source tools based on hybrid DNN/HMM architecutres , ESPnet provides a single neural network architecture to perform speech recognition in an end-to-end manner. ESPnet adopts widely-used dynamic neural network toolkits, Chainer and PyTorch , as a main deep learning engine. ESPnet also follows the style of Kaldi ASR toolkit for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

ESPnet fully utilizes benefits of two major end-to-end ASR implementations based on both connectionist temporal classification (CTC) and attention-based encoder-decoder network . Attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, while CTC uses Markov assumptions to efficiently solve sequential problems by dynamic programming. ESPnet adopts hybrid CTC/attention end-to-end ASR , which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness on irregular alignments and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments.

In addition to the above basic architecture, ESPnet supports a number of end-to-end ASR techniques including a fusion of recurrent neural network language model (RNNLM) , fast CTC computation by using the warp CTC library , many variations of attention methods. With these state-of-the-art end-to-end ASR techniques, ESPnet also provides a number of recipes for major ASR benchmarks including Wall Street Journal (WSJ) , Librispeech , TED-LIUM , Corpus of Spontaneous Japanese (CSJ) , AMI , HKUST Mandarin CTS , VoxForge , CHiME-4/5 , etc. Thus, ESPnet provides publicly available state-of-the-art end-to-end ASR setups, which aim to accelerate the development of this emergent field. This paper describes its basic architecture, functionalities, and benchmark results. Note that several benchmarks including HKUST and CSJ score comparable/superior performance to the state-of-the-art hybrid DNN/HMM systems based on lattice-free maximum mutual information training .

Related studies

This section mainly focuses on the comparison of ESPnet with publicly available toolkits within an end-to-end ASR framework. We can categorize the toolkits into two types based on CTC and attention architectures as follows:

CTC-based: EESEN , Stanford CTC , Baidu Deepsppech ,

Attention-based: Attention-LVCSR , OpenNMT speech to text

Note that most of end-to-end ASR toolkits are based on CTC, while ESPnet is based on an attention-based encoder-decoder network. Compared with Attention-LVCSR and OpenNMT, ESPnet has more specific functions to ASR applications including hybrid CTC/attention to deal with monotonic attentions, use of RNNLM during decoding, and a number of Kaldi-style ASR recipes, which make ESPnet unique to the other toolkits.

Functionality

Figure 1 shows a software architecture of ESPnet. In the ESPnet, main neural network training and recognition parts are written in python, which calls Chainer and PyTorch by switching the backend option. We also provide complete recipes to perform ASR experiments, which are written in the bash scripts by following the Kaldi manner. The following sections describe several unique functions of ESPnet from existing other toolkits.

ESPnet tightly integrates its data preprocessing part with Kaldi so that 1) we can fairly compare the performance obtained by Kaldi hybrid systems with ESPnet end-to-end systems and 2) we can make use of data preprocessing developed in the Kaldi recipe. ESPnet also uses Kaldi feature extraction for most of recipes, although multichannel end-to-end ASR includes speech enhancement and feature extraction with its network.

2 Attention-based encoder-decoder

The default encoder network is represented by bidirectional long short-term memory (BLSTM) with subsampling (called pyramid BLSTM ) given $T$ -length speech feature sequence $\mathbf{o}_{1:T}$ to extract high-level feature sequence $\mathbf{h}_{1:T^{\prime}}$ as

where $T^{\prime}<T$ in general due to the subsampling. The Chainer backend also supports convolutional neural networks based on initial two blocks of VGG layer ( $\text{VGG}_{2}()$ ) followed by BLSTM layers inspired by , that is

This yields better performance than the pyramid BLSTM in many cases.

2.2 Attention

ESPnet uses a location-aware attention mechanism , as a default attention. A dot-product attention is also supported. While the location-aware attention yields better performance, the dot-product attention is much faster in terms of the computational cost. In addition to above attentions, the PyTorch backend supports more than 11 types of attention functions including additive attention , coverage mechanism , and multi-head attention .

3 Hybrid CTC/attention

ESPnet adopts hybrid CTC/attention end-to-end ASR , which effectively utilizes the advantages of both architectures in training and decoding.

During training, we employ the multi objective learning framework by combining CTC $\mathcal{L}^{\text{ctc}}$ and attention-based cross entropy $\mathcal{L}^{\text{att}}$ to improve robustness and achieve fast convergence, as follows:

This training method shares the same encoder with CTC and attention decoder networks. We have one tuning parameter $\alpha$ to linearly interpolate both objective functions and usually set as $\alpha=0.5$ (equal contributions).

To alleviate overfitting problems, label smoothing techniques are available during training, which smooth the target distribution by dividing the probability mass for the correct label and the remaining labels in a certain ratio. We implemented unigram smoothing, where the distribution of remaining labels is set to be proportional to the unigram distribution of the labels .

3.2 Warp CTC

CTC is one of the dominant parts for whole computation time in the training. We use a warp CTC library developed by for both Chainer and PyTorch backends, which yields 5-10% speed improvement in the total training time, compared with build-in CTC in the Chainer backend case.

3.3 Joint decoding

During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Let $y_{n}$ be a hypothesis of output label at position $n$ given a history $y_{1:n-1}$ and encoder output $\mathbf{h}_{1:T^{\prime}}$ . The following score combination with attention $p^{\text{att}}$ and CTC $p^{\text{ctc}}$ log probabilities is performed during the beam search:

This hybrid CTC/attention architecture (multiobjective learning during training and joint decoding during recognition) is proposed in , and a unique function compared with the other end-to-end ASR systems.

4 Use of language model

One of the most demanded functions of attention-based end-to-end ASR is how to make use of a language model trained with large amount of text corpora. ESPnet can combine the log probability $p^{\text{lm}}$ of RNNLM during decoding as follows:

$\beta$ is an additional scaling parameter. This method corresponds to a shallow fusion of a decoder network and RNNLM originally proposed in neural machine translation and applied to end-to-end speech recognition .

5 ASR setup in adverse environments

Although most of ASR recipes supported in ESPnet are standard English tasks, current ESPnet recipes deal with other languages including Japanese (CSJ), Mandarin Chinese (HKUST CTS), and other European languages through VoxForge. With these various recipes, ESPnet can also realize multilingual end-to-end ASR system (e.g., 10 languages) by following our previous study . In addition, the ESPnet recipes also include noise robust/far-field speech recognition tasks including AMI , CHiME-4 , and CHiME-5 tasks . Especially ESPnet is an official end-to-end ASR baseline for the CHiME-5 challenge.

Implementation

Figure 2 shows a flow of standard recipes in ESPnet. The recipe is significantly simplified thanks to the benefit of end-to-end ASR, e.g., it does not have to include lexicon preparation, finite state transducer (FST) compilation, training/alignment based on HMM and Gaussian mixture modeling, and lattice generation for sequence discriminative training.

The standard recipe includes the following 6 stages in run.shSeveral recipes including AMI, Librispeech, TED-LIUM, and VoxForge have an additional data downloading stage (stage -1).:

Data preparation: We adopt the Kaldi data directory format, and we can simply use the Kaldi data preparation script (e.g., data_prep.sh).

Feature extraction: Again, we use the Kaldi feature extraction. Most of recipes use the 80-dimensional log Mel feature with the pitch feature (totally 83 dimensions).

Data preparation for ESPnet: This stage converts all the information including in the Kaldi data directory (transcriptions, speaker and language IDs, and input and output lengths) to one JSON file (data.json) except for input features.

Language model training: Character-based RNNLM is trained by using either Chainer or PyTorch backend. This is an optional stage, and several recipes do not have this stage).

End-to-end ASR training: Hybrid CTC/attention-based encoder-decoder is trained by using either Chainer or PyTorch backend.

Recognition: Speech recognition is performed by using RNNLM and end-to-end ASR model obtained by stages 3 and 4, respectively.

2 Code lines

In addition to the actual experimental stage, ESPnet also simplifies its coding lines. Table 1 compared the main source code of Kaldi, Julius, and ESPnet. ESPnet can realize speech recognition including trainer and recognizer functions by only using 5K lines of python codes compared with Kaldi and Julius, thanks to the simplification of end-to-end ASR and use of Chainer or PyTorch for neural network backends and Kaldi for data preparation and feature extractionSince Kaldi and Julius have various function including online real-time modes and Windows interfaces unlike ESPnet, we cannot directly compare them with the source code lines..

One of the most simplified module is a model representation part, since it does not have to explicitly represent a complicated speech recognition hierarchy from speech features, HMM states, context dependent phonemes, lexicons, to words. This hierarchy is represented by a single neural network with at most thousand lines of python codes. This also yields to simplify the recognition module with at most five hundred lines, as it is realized by a simple output-synchronous beam search.

Experiments

This section discusses the experimental results of our three main tasks, WSJ, CSJ, and HKUST. The first experiment shows the effectiveness of the ESPnet with the famous WSJ tasks by using several experimental configurations, and also compare the reports on the same task within an end-to-end ASR framework. The other experiments compare the performance of ESPnet with state-of-the-art ASR systems for the CSJ and HKUST tasks. The main reason for choosing these two languages is that these ideogram languages have relatively shorter lengths for letter sequences than those in alphabet languages, which greatly reduces the computational complexities, and makes it easy to handle context information in a decoder network. Actually, our prior investigation shows that Japanese and Mandarin Chinese end-to-end ASR can be easily scaled up, and shows reasonable performance without using various tricks developed for large-scale English tasks.

Table 2 compares the performance of the ESPnet with different techniques in the WSJ task. The use of a deeper encoder network, integration of character-based LSTMLM, and joint CTC/attention decoding steadily improved the performance. Table 2 also compares the result of ESPnet with the other reports. Since these reports are based on different conditions (e.g., does not use any language models, while and use a word-based language model through FST), we cannot directly compare them. But we can state that ESPnet provides reasonable performance by comparing with these prior studies. Table 2 also provides the computational time for main end-to-end ASR network training with number of GPUs. ESPnet achieved very fast training especially for the PyTorch backend even with a single GPU (gtx1080ti), compared with for the same WSJ task.

However, one of the issues of these end-to-end ASR systems is that their performance does not reach that of the state-of-the-art hybrid HMM/DNN systems. For example, the WER of the hybrid HMM/DNN systems for the WSJ task is below 5%, and this degradation probably comes from the lack of the amount of training data. Actually, and report comparable or superior performance to the state-of-the-art hybrid HMM/DNN systems in very large English tasks, although these results are not usually accomplished by many of research communities due to the lack of computational resources. Therefore, scaling up the English task with keeping low computational resources or improving the performance by mitigating the data sparseness issue is one of our important future studies.

Compared with the English tasks, end-to-end ASR systems can easily achieve comparable performance to the state-of-the-art hybrid HMM/DNN systems in the Japanese and Mandarin Chinese tasks. Note that ESPnet does not use lexical information (pronunciation dictionary and morphological analyzer), which are essential components in the HMM/DNN and CTC-syllable systems. Tables 3 and 4 compare the best system of ESPnet (i.e., VGG2-BLSTM, char-RNNLM, and joint decoding) with the hybrid HMM/DNN systems. Especially, ESPnet almost reached the latest best performance of the HMM/DNN system with lattice-free MMI training in the HKUST task.

Conclusions

This paper introduced a new end-to-end ASR toolkit named ESPnet. ESPnet fully utilizes dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine, and extremely simplifies training and recognition of the whole ASR pipeline. A number of experiments and comparisons with other reports show that ESPnet achieves reasonable ASR performance and also reaches comparable performance to the state-of-the-art HMM/DNN systems with a legacy setup. ESPnet has been actively developed, and multi-GPU function, data augmentation, multihead decoder, multichannel end-to-end ASR, and Babel multilingual ASR experiments are in preparation. Especially with the multi-GPU function (5 GPUs), ESPnet finished the training of 581 hours of the CSJ task only with 26 hours.