EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding
Yajie Miao, Mohammad Gowayyed, Florian Metze
Introduction
Automatic speech recognition (ASR) has traditionally leveraged the hidden Markov model/Gaussian mixture model (HMM/GMM) paradigm for acoustic modeling. HMMs act to normalize the temporal variability, whereas GMMs compute the emission probabilities of HMM states. In recent years, the performance of ASR has been improved dramatically by the introduction of deep neural networks (DNNs) as acoustic models . In the hybrid HMM/DNN approach, DNNs are used to classify speech frames into clustered context-dependent (CD) states (i.e., senones). On a variety of ASR tasks, DNN models have shown significant gains over the GMM models. Despite these advances, building a state-of-the-art ASR system remains a complicated, expertise-intensive task. First, acoustic modeling typically requires various resources such as dictionaries and phonetic questions. Under certain conditions (e.g., in low-resource languages), these resources may be unavailable, which restricts or delays the deployment of ASR. Second, in the hybrid approach, training of DNNs still relies on GMM models to obtain (initial) frame-level labels. Building GMM models normally goes through multiple stages (e.g., CI phone, CD states, etc.), and every stage involves different feature processing techniques (e.g., LDA, fMLLR, etc.). Third, the development of ASR systems highly relies on ASR experts to determine the optimal configurations of a multitude of hyper-parameters, for instance, the number of senones and Gaussians in the GMM models.
Previous work has made various attempts to reduce the complexity of ASR. In , researchers propose to flat-start DNNs and thus get ride of GMM models. However, this GMM-free approach still requires iterative procedures such as generating forced alignments and decision trees. Meanwhile, another line of work has focused on end-to-end ASR, i.e., modeling the mapping between speech and labels (words, phonemes, etc.) directly without any intermediate components (e.g., GMMs). On this aspect, Graves et al. introduce the connectionist temporal classification (CTC) objective function to infer speech-label alignments automatically. This CTC technique is further investigated in on large-scale acoustic modeling tasks. Although showing promising results, research on end-to-end ASR faces two major obstacles. First, it is challenging to incorporate lexicons and language models into decoding. When decoding CTC-trained models, past work has successfully constrained search paths with lexicons. However, how to integrate word-level language models efficiently still is an unanswered question . Second, the community lacks a shared experimental platform for the purpose of benchmarking. End-to-end systems described in the literature differ not only in their model architectures but also in their decoding methods. For example, and adopt two distinct versions of beam search for decoding CTC models. These setup variations hamper rigorous comparisons not only across end-to-end systems, but also between the end-to-end and existing hybrid approaches.
In this paper, we resolve these issues by presenting and publicly releasing our Eesen framework. Acoustic modeling in Eesen is viewed as a sequence-to-sequence learning problem. We exploit deep recurrent neural networks (RNNs) as the acoustic models, and the Long Short-Term Memory (LSTM) units as the RNN building blocks. Using the CTC objective function, Eesen simplifies acoustic modeling into learning a single RNN over pairs of speech and context-independent (CI) label sequences. A distinctive feature of Eesen is a generalized decoding method based on weighted finite-state transducers (WFSTs). In this method, individual components (CTC labels, lexicons and language models) are encoded into WFSTs, and then composed into a comprehensive search graph. The WFST representation provides a convenient way of handling the CTC blank label and enabling beam search during decoding. Our experiments with the Wall Street Journal (WSJ) benchmark show that Eesen results in superior performance than the existing end-to-end ASR pipelines . The WERs of Eesen are on a par with strong hybrid HMM/DNN baselines. Moreover, the application of CI modeling targets allows Eesen to speed up decoding and reduce decoding memory usage. Eesen is released as an open-source project111https://github.com/yajiemiao/eesen, and will undertake continuous expansion and optimization.
The EESEN Framework: Model Training
Acoustic models in Eesen are deep bidirectional RNNs trained with the CTC objective function . We describe the model structure in Section 2.1, and restate key points of CTC training in Section 2.2. Section 2.3 presents some practical considerations emerging from our GPU implementation.
Compared to the standard feedforward networks, RNNs have the advantage of learning complex temporal dynamics on sequences. Given an input sequence , a recurrent layer computes the forward sequence of hidden states by iterating from to :
subscript→Wℎ𝑥subscriptx𝑡subscript→Wℎℎsubscript→h𝑡1subscript→bℎ\overrightarrow{\textbf{h}}_{t}=\sigma(\overrightarrow{\textbf{W}}_{hx}\textbf{x}_{t}+\overrightarrow{\textbf{W}}_{hh}\overrightarrow{\textbf{h}}_{t-1}+\overrightarrow{\textbf{b}}_{h}) (1) where is the input-to-hidden weight matrix, is the hidden-to-hidden weight matrix. In addition to the inputs , the hidden activation from the previous time step are fed to influence the hidden outputs at the current time step. In a bidirectional RNN, an additional recurrent layer computes the backward sequence of hidden outputs from to :
subscript←Wℎ𝑥subscriptx𝑡subscript←Wℎℎsubscript←h𝑡1subscript←bℎ\overleftarrow{\textbf{h}}_{t}=\sigma(\overleftarrow{\textbf{W}}_{hx}\textbf{x}_{t}+\overleftarrow{\textbf{W}}_{hh}\overleftarrow{\textbf{h}}_{t-1}+\overleftarrow{\textbf{b}}_{h}) (2) Our acoustic model is a deep architecture, in which we stack multiple bidirectional recurrent layers. At each frame , the concatenation of the forward and backward hidden outputs from the current layer are treated as inputs into the next recurrent layer.
Learning of RNNs can be done using back-propagation through time (BPTT). In practice, training RNNs to learn long-term temporal dependency can be difficult due to the vanishing gradients problem . To overcome this issue, we apply the LSTM units as the building blocks of RNNs. LSTM contains memory cells with self-connections to store the temporal states of the network. Also, multiplicative gates are added to control the flow of information. Fig. 1 depicts the structure of the LSTM units we use. The blue curves represent peephole connections that link the memory cells to the gates to learn precise timing of the outputs.
The computation at the time step can be formally written as follows. We omit the arrow for uncluttered formulation.
subscriptW𝑖𝑥subscriptx𝑡subscriptW𝑖ℎsubscripth𝑡1subscriptW𝑖𝑐subscriptc𝑡1subscriptb𝑖\displaystyle\textbf{i}_{t}=\sigma(\textbf{W}_{ix}\textbf{x}_{t}+\textbf{W}_{ih}\textbf{h}_{t-1}+\textbf{W}_{ic}\textbf{c}_{t-1}+\textbf{b}_{i}) (3a) (3b) (3c) (3d) (3e) where , , , are the activation of the input gates, output gates, forget gates and memory cells respectively. The weight matrices connect the inputs with the units, whereas the matrices connect the previous hidden states with the units. The terms are diagonal weight matrices for peephole connections. Also, is the logistic sigmoid nonlinearity, and is the hyperbolic tangent nonlinearity. The computation of the backward LSTM layer can be represented similarly. In this work, we use a purely LSTM-based architecture as the acoustic model. However, combing LSTMs with other network structures, e.g., time-delay or convolutional neural networks , is straightforward to achieve.
2 Training with Connectionist Temporal Classification
Unlike in the hybrid approach, the RNN model in our Eesen framework is not trained using frame-level labels with respect to the cross-entropy (CE) criterion. Instead, following , we adopt the CTC objective to automatically learn the alignments between speech frames and their label sequences (e.g., phonemes or characters). Assume that the label sequences in the training data contain unique labels. Normally is a relatively small number, e.g., around 45 for English when the labels are phonemes. An additional blank label , which means no labels being emitted, is added to the labels. For simplicity of formulation, we denote every label using its index in the label set. Given an utterance , its label sequence is denoted as . In our implementation, we always index the blank as 0. Therefore is an integer ranging from 1 to . The length of z is constrained to be no greater than the length of the utterance, i.e., . CTC aims to maximize , the log-likelihood of the label sequence given the inputs, by optimizing the RNN model parameters.
The final layer of the RNN is a softmax layer which has nodes that correspond to the labels (including ). At each frame , we get the output vector whose -th element is the posterior probability of the label . However, since the labels z are not aligned to the frames, it is difficult to evaluate the likelihood of z given the RNN outputs. To bridge the RNN outputs with label sequences, an intermediate representation, the CTC path, is introduced in . A CTC path is a sequence of labels at the frame level. It differs from z in that the CTC path allows occurrences of the blank label and repetitions of non-blank labels. The total probability of the CTC path is decomposed into the probability of the label at each frame:
The label sequence z can then be mapped to its corresponding CTC paths. This is a one-to-multiple mapping because multiple CTC paths can correspond to the same label sequence. For example, both “A A B C ” and “ A A B C C” are mapped to the label sequence “A B C”. We denote the set of CTC paths for z as . Then, the likelihood of z can be evaluated as a sum of the probabilities of its CTC paths:
However, summing over all the CTC paths is computationally intractable. A solution is to represent the possible CTC paths compactly as a trellis. To allow blanks in CTC paths, we add “0” (the index of ) to the beginning and the end of z, and also insert “0” between every pair of the original labels in z. The resulting augmented label sequence is leveraged in a forward-backward algorithm for efficient likelihood evaluation. Specifically, in a forward pass, the variable represents the total probability of all CTC paths that end with label at frame . As with the case of HMMs , can be recursively computed from and . Similarly, a backward variable carries the total probability of all CTC paths that starts with label at and reaches the final frame . The likelihood of the label sequence z can then be computed as:
2𝑈1superscriptsubscript𝛼𝑡𝑢superscriptsubscript𝛽𝑡𝑢Pr(\textbf{z}|\textbf{X})={\sum_{u=1}^{2U+1}{\alpha_{t}^{u}\beta_{t}^{u}}} (6) where can be any frame . The objective now becomes differentiable with respect to the RNN outputs . We define an operation on the augmented label sequence that returns the elements of l which have the value . The derivative of the objective with respect to can be derived as:
𝑃𝑟conditionalzXsuperscriptsubscript𝑦𝑡𝑘1𝑃𝑟conditionalzX1superscriptsubscript𝑦𝑡𝑘subscript𝑢Υl𝑘superscriptsubscript𝛼𝑡𝑢superscriptsubscript𝛽𝑡𝑢{\partial\ln Pr(\textbf{z}|\textbf{X})\over\partial y_{t}^{k}}={1\over Pr(\textbf{z}|\textbf{X})}{1\over y_{t}^{k}}\sum_{u\in\Upsilon(\textbf{l},k)}\alpha_{t}^{u}\beta_{t}^{u} (7) These errors are back-propagated through the softmax layer and further into the RNN to update the model parameters.
3 GPU Implementation
We implement the training of the RNN models on GPU devices. To fully exploit the capacity of GPUs, multiple utterances are processed at a time in parallel. This parallel processing speeds up model training by replacing matrix-vector multiplication over single frames with matrix-matrix multiplication over multiple frames. Within a group of parallel utterances, we pad every utterance to the length of the longest utterance in the group. These padding frames are excluded from gradients computation and parameter updating. For further acceleration, the training utterances are sorted by their lengths, from the shortest to the longest. The utterances in the same group then have approximately the same length, which minimizes the number of padding frames. To ensure training stability, the gradients of RNN parameters are clipped to the range of .
CTC learning is also expensive because the forward and backward vectors ( and ) have to be computed sequentially, either from to or from to . Like in RNNs, our implementation of CTC also processes multiple utterances at the same time. Moreover, at a specific frame , the elements of (and ) are independent and thus can be computed in parallel.
The EESEN Framework: Decoding
Previous work has introduced a variety of methods to decode CTC-trained models. These methods, however, either fail to integrate word-level language models or achieve the integration under constrained conditions (e.g., n-best list rescoring in ). In this work, we propose a generalized decoding approach based on WFSTs . A WFST is a finite-state acceptor (FSA) in which each transition has an input symbol, an output symbol and a weight. A path through the WFST takes a sequence of input symbols and emits a sequence of output symbols. Our decoding method represents the CTC labels, lexicons and language models as separate WFSTs. Using highly-optimized FST libraries such as OpenFST , we can fuse the WFSTs efficiently into a single search graph. Building of the individual WFSTs is described as follows. Although exemplified in the scenario of English, the same procedures hold for other languages.
Grammar. A grammar WFST encodes the permissible word sequences in a language/domain. The WFST shown in Fig. 2 represents a toy language model which permits two sentences “how are you” and “how is it”. The WFST symbols are the words, and the arc weights are the language model probabilities. With this WFST representation, CTC decoding in principle can leverage any language models that can be converted into WFSTs. Following conventions in the literature , the language model WFST is denoted as .
Lexicon. A lexicon WFST encodes the mapping from sequences of lexicon units to words. Depending on what labels our RNN has modeled, there are two cases to consider. If the labels are phonemes, the lexicon is a standard dictionary as we normally have in the hybrid approach. When the labels are characters, the lexicon simply contains the spellings of the words. A key difference between these two cases is that the spelling lexicon can be easily expanded to include any out-of-vocabulary (OOV) words. In contrast, expansion of the phoneme lexicon is not so straightforward. It relies on some grapheme-to-phoneme rules/models, and is potentially subject to errors. The lexicon WFST is denoted as . Fig. 3 and 4 illustrate these two cases of building .
For the spelling lexicon, there is another complication to deal with. With characters as CTC labels, we usually insert an additional space character between every pair of words, in order to model word delimiting in the original transcripts. During decoding, we allow the space character to optionally appear at the beginning and end of a word. This complication can be handled easily by the WFST shown in Fig. 4.
Token. The third WFST component maps a sequence of frame-level CTC labels to a single lexicon unit (phoneme or character). For a lexicon unit, its token WFST is designed to subsume all of its possible label sequences at the frame level. Therefore, this WFST allows occurrences of the blank label , as well as repetitions of any non-blank lables. For example, after processing 5 frames, the RNN model may generate 3 possible label sequences “AAAAA”, “ A A ”, “ A A A ”. The token WFST maps all these 3 sequences into a singleton lexicon unit “A”. Fig. 5 shows the WFST structure for the phoneme ”IH”. We denote the token WFST as .
Search Graph. After compiling the three individual WFSTs, we compose them into a comprehensive search graph. The lexicon and grammar WFSTs are firstly composed. Two special WFST operations, determinization and minimization, are performed over the composition of them, in order to compress the search space and thus speed up decoding. The resulting WFST is then composed with the token WFST, which finally generates the search graph. Overall the oder of the FST operations is:
where , and denote composition, determinization and minimization respectively. The search graph encodes the mapping from a sequence of CTC labels emitted on speech frames to a sequence of words.
2 Posterior Normalization
When decoding the hybrid DNN models, we need to scale the states posteriors from the DNNs using states priors. The priors are usually estimated from the forced alignments of the training data. During decoding of the CTC-trained models, we adopt a similar procedure. Specifically, we run the final RNN model over the training set for a propagation pass. Labels with the largest posteriors are picked as the frame-level alignments, from which priors of the labels are estimated. However, this method does not perform well in our experiments. Part of the reason is that the softmax-layer outputs from a CTC-trained model display a highly peaky distribution . That is, a majority of the frames have the blank as their labels. The activation of the non-blank labels only appears in a narrow region along the time axis. This causes the prior estimates to be dominated by the count of the blank.
Alternatively, we propose to estimate more robust label priors from the label sequences in the training data. As mentioned in Section 2.2, the label sequences actually used by CTC training are the augmented label sequences, which insert the blank at the beginning, at the end, and between every label pair in the original label sequences. We compute the priors from the augmented label sequences (e.g., ” IH Z ”), instead of the original ones (e.g., ”IH Z”), through simple counting. In our experiments, this simple method gives better recognition accuracy than both the aforementioned frame-alignment method and also the proposal described in .
Experiments
The experiments are conducted on the Wall Street Journal (WSJ) corpus that can be obtained from LDC under the catalog numbers LDC93S6B and LDC94S13B. Data preparation gives us 81 hours of transcribed speech, from which we select 95% as the training set and the remaining 5% for cross validation. As discussed in Section 2, we apply deep RNNs as the acoustic models. Inputs of the RNNs are 40-dimensional filterbank features together with their first and second-order derivatives. The features are normalized via mean subtraction and variance normalization on the speaker basis.
Initial values of the model parameters are randomly drawn from a uniform distribution with the range [0.1, 0.1]. The model is trained with BPTT, in which the errors are back-propagated from CTC. Utterances in the training set are sorted by their lengths, and 10 utterances are processed in parallel at a time. The error rate of the hypothesized labels is monitored to determine learning rates. The hypothesized labels are formed by firstly picking the frame-level labels (the label with the largest probability at every frame), and then removing blanks and label repetitions. The label error rate (LER) can be obtained in the same manner as WER, i.e., computing the edit distance between the hypothesized labels and the reference. We adopt a decaying “newbob” learning rate schedule based on LERs. Specifically, the learning rate starts from 0.00004 and remains unchanged until the drop of LER on the validation set between two consecutive epochs falls below 0.5%. Then the learning rate is decayed by a factor of 0.5 at each of the subsequent epochs. The whole learning process terminates when the LER fails to decrease by 0.1% between two successive epochs.
Our decoding follows the WFST-based approach in Section 3. After posterior normalization, the acoustic model scores need to be scaled down. The scaling factor lies between 0.5 and 0.9, and its optimal value is decided empirically. We apply the WSJ standard pruned trigram language model in the ARPA format (which we will consistently refer to as standard). To be consistent with previous work , we report our results on the eval92 set. Our experimental setup has been released together with Eesen, which enables the readers to reproduce our numbers easily.
2 Phoneme-based Systems
We explore the optimal RNN configurations on the phoneme-based systems. When phonemes are taken as CTC labels, we employ the CMU dictionary222http://www.speech.cs.cmu.edu/cgi-bin/cmudict as the lexicon. Due to the lack of forced alignments, CTC training cannot handle multiple pronunciations for the same word. For every word, we only keep its first pronunciation in the lexicon and remove all the other alternatives. From the lexicon, we extract 72 labels including phonemes, noise marks and the blank. Our best-performing model has 4 bi-directional LSTM layers. At each layer, both the forward and the backward sub-layers contain 320 memory cells. Model training ends up to reach the LER (phone error rate in this setting) of 8.8% on the validation set. On the eval92 testing set, the Eesen end-to-end system finally achieves the WER of 7.87%, with both the lexicon and the language model used in decoding. When only the lexicon is used, our decoding behaves similarly as the beam search in . In this case, the WER rises quickly to 26.92%. This obvious degradation reveals the effectiveness of our decoding approach in integrating language models.
Table 1 shows a comparison between Eesen and a hybrid HMM/DNN system. The hybrid system is constructed by following the standard Kaldi recipe “s5” . Inputs of the DNN model are 11 neighboring frames of filterbank features. The DNN has 6 hidden layers and 1024 units at each layer. This DNN model contains slightly more parameters (9.2 vs 8.5 million) than the Eesen RNN model. Parameters of the DNN are initialized with restricted Boltzmann machines (RBMs) that are pre-trained in a greedy layerwise fashion . The DNN is fine-tuned to optimize the CE objective with respect to 3421 senones. For fair evaluations, we decode the DNN model using the original lexicon, rather than the expanded lexicon used by the Kaldi recipe. From Table 1, we observe that the performance of the Eesen system is still behind the hybrid HMM/DNN system. Our recent developments of Eesen reveal that CTC-trained models outperform the existing hybrid systems on large-sized datasets, e.g., Switchboard. Interested readers may refer to the Eesen repository for the updates.
A major advantage of Eesen compared with the hybrid approach is the decoding speed. The acceleration comes from the drastic reduction of the number of states, i.e., from thousands of senones to tens of phonemes/characters. To verify this, Table 2 compares the decoding speed of the Eesen and the hybrid HMM/DNN systems under their best decoding settings. From their real-time factors, we observe that decoding in Eesen is 3.2 faster than that of HMM/DNN. Also, the decoding graph (TLG) in Eesen is significantly smaller than the graph (HCLG) used by HMM/DNN, which saves the disk space for storing the graphs.
Model LM #Param WER% Eesen RNN lexicon 8.5M 26.92 Eesen RNN trigram 8.5M 7.87 Hybrid HMM/DNN trigram 9.2M 7.14
Model RTF Graph Size Eesen RNN 0.64 263 Hybrid HMM/DNN 2.06 480
3 Character-based Systems
We apply the same RNN architecture discussed in Section 4.2 to modeling characters. We take the word list from the CMU dictionary as our vocabulary, ignoring the word pronunciations. CTC training deals with 59 labels including letters, digits, punctuation marks, etc. Table 3 shows that with the standard language model, the character-based system gets the WER of 9.07%.
CTC experiments in past work have adopted an expanded vocabulary, and re-trained the language model using text data released together with the WSJ corpus. For fair comparison, we follow the identical configuration. OOV words that occur at least twice in the language model training texts are added to the vocabulary. A new trigram language model is built (and then pruned) with the language model training texts. Under this setup, the WER of the Eesen character-based system is reduced to 7.34%.
Table 3 lists the results of end-to-end ASR systems that have been reported in the previous work and on the same dataset. Our Eesen framework outperforms both and in terms of WERs on the testing set. It is worth pointing out that the 8.7% WER reported in is obtained not in a purely end-to-end manner. Instead, the authors of generate a n-best list of hypotheses from a hybrid DNN model, and apply the CTC model to rescore the hypotheses candidates. Our Eesen numbers, in contrast, come from a completely end-to-end pipeline, without any intervention from GMM or hybrid DNN models.
System Vocabulary Language Model WER% Eesen Original Standard 9.07 Eesen Expanded Re-trained 7.34 Graves et al. Expanded Re-trained 8.7 Hannun et al. Original Unknown 14.1
Conclusions and Future Work
In this work, we present our Eesen framework to build end-to-end ASR systems. Eesen exploits deep RNNs as the acoustic models and CTC as the training objective function. We train the RNN models in a single step, and thus are able to reduce the complexity of ASR system development. The WFST-based decoding enables efficient and effective incorporation of lexicions and language models. Because of its open-source property, Eesen can serve as a shared benchmark platform for research on end-to-end ASR.
In our future work, we plan to further improve the WERs of Eesen systems via more advanced learning techniques (e.g., expected transcription loss in ) and alternative decoding approach (e.g., dynamic decoders ). Also, we are interested to apply Eesen to various languages and different types (e.g., noisy, far-field) of speech, and investigate how end-to-end ASR performs under these conditions. Moreover, due to the removal of GMMs, acoustic modeling in Eesen cannot leverage speaker adapted front-ends. We will study new speaker adaptation and adaptive training techniques for the CTC models.
Acknowledgements
This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. This research was performed as part of the Speech Recognition Virtual Kitchen project, which is supported by the United States National Science Foundation under grant number CNS-1305365. This work was partially funded by Facebook, Inc. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Facebook, Inc.