The PyTorch-Kaldi Speech Recognition Toolkit
Mirco Ravanelli, Titouan Parcollet, Yoshua Bengio
Introduction
Over the last years, we witnessed a progressive improvement and maturation of Automatic Speech Recognition (ASR) technologies , that have reached unprecedented performance levels and are nowadays used by millions of users worldwide.
A key role in this technological breakthrough is being played by deep learning , that contributed to overcoming previous speech recognizers based on Gaussian Mixture Models (GMMs). Beyond deep learning, other factors have played a role in the progress of the field. A number of speech-related projects such as AMI and DIRHA and speech recognition challenges such as CHiME , Babel, and Aspire, have remarkably fostered the progress in ASR. The public distribution of large datasets such as Librispeech has also played an important role to establish common evaluation frameworks and tasks.
Among the others factors, the development of open-source software such as HTK , Julius , CMU-Sphinx, RWTH-ASR , LIA-ASR and, more recently, the Kaldi toolkit have further helped popularize ASR, making both research and development of novel ASR applications significantly easier.
Kaldi currently represents the most popular ASR toolkit. It relies on finite-state transducers (FSTs) and provides a set of C++ libraries for efficiently implementing state-of-the-art speech recognition systems. Moreover, the toolkit includes a large set of recipes that cover all the most popular speech corpora. In parallel to the development of this ASR-specific software, several general-purpose deep learning frameworks, such as Theano , TensorFlow , and CNTK , have gained popularity in the machine learning community. These toolkits offer a huge flexibility in the neural network design and can be used for a variety of deep learning applications.
PyTorch is an emerging python package that implements efficient GPU-based tensor computations and facilitates the design of neural architectures, thanks to proper routines for automatic gradient computation. An interesting feature of PyTorch lies in its modern and flexible design, that naturally supports dynamic neural networks. In fact, the computational graph is dynamically constructed on-the-fly at running time rather than being statically compiled.
The PyTorch-Kaldi project aims to bridge the gap between Kaldi and PyTorchgithub.com/mravanelli/pytorch-kaldi/.. Our toolkit implements acoustic models in PyTorch, while feature extraction, label/alignment computation, and decoding are performed with Kaldi, making it suitable to develop state-of-the-art DNN-HMM speech recognizers. PyTorch-Kaldi natively supports several DNNs, CNNs, and RNNs models. Combinations between deep learning models, acoustic features, and labels are also supported, enabling the use of complex neural architectures. For instance, users can employ a cascade between CNNs, LSTMs, and DNNs, or run in parallel several models that share some hidden layers. Users can also explore different acoustic features, context duration, neuron activations (e.g., ReLU, leaky ReLU), normalizations (e.g., batch and layer normalization ), cost functions, regularization strategies (e.g, L2, dropout ), optimization algorithms (e.g., Adam , RMSPROP), and many other hyper-parameters of an ASR system through simple edits of configuration files.
The toolkit is designed to make the integration of user-defined acoustic models as simple as possible. In practice, users can embed their deep learning model and conduct ASR experiments even without being fully familiar with the complex speech recognition pipeline. The toolkit can perform computations on both local machines and HPC cluster, and supports multi-gpu training, recovery strategy, and automatic data chunking.
The experiments, conducted on several datasets and tasks, have shown that PyTorch-Kaldi makes it possible to easily develop competitive state-of-the-art speech recognition systems.
The PyTorch-Kaldi Project
Some other speech recognition toolkits have been recently developed using the python language. PyKaldi , for instance, is an easy-to-use Python wrapper for the C++ code of Kaldi and OpenFst libraries. Differently from our toolkit, however, the current version of the PyKaldi does not provide several pre-implemented and ready-to-use neural models. Another python project is ESPnet . ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech. The main difference with our project is the current version of PyTorch-Kaldi implements hybrid DNN-HMM speech recognizers.
An overview of the architecture adopted in PyTorch-Kaldi is reported in Fig. 1. The main script is written in python and manages all the phases involved in an ASR system, including feature and label extraction, training, validation, decoding, and scoring. The toolkit is detailed in the following sub-sections.
The main script takes as input a configuration file in INI formatThe configuration file is fully described in the project documentation., that is composed of several sections. The section specifies some high-level information such as the folder used for the experiment, the number of training epochs, the random seed. It also allows users to specify whether the experiments have to be conducted on a CPU, GPU, or on multiple GPUs. The configuration file continues with the sections, that specify information on features and labels, including the paths where they are stored, the characteristics of the context window , and the number of chunks in which the speech dataset must be split. The neural models are described in the sections, while the section defines how these neural networks are combined. The latter section exploits a simple meta-language that is automatically interpreted by the script. Finally, the configuration file defines the decoding parameters in the section.
2 Features
The feature extraction is performed with Kaldi, that natively provides c++ libraries (e.g., compute-mfcc-feats, compute-fbank-feats, compute-plp-feats) to efficiently extract the most popular speech recognition features. The computed coefficients are stored in binary archives (with extension .ark) and are later imported into the python environment using the kaldi-io utilities inherited from the kaldi-io-for-python projectgithub.com/vesis84/kaldi-io-for-python. The features are then processed by the function load-chunk, that performs context window composition, shuffling, as well as mean and variance normalization. As outlined before, PyTorch-Kaldi can manage multiple feature streams. For instance, users can define models that exploit combinations of MFCCs, FBANKs, PLP, and fMLLR coefficients.
3 Labels
The main labels used for training the acoustic model derive from a forced alignment procedure between the speech features and the sequence of context-dependent phone states computed by Kaldi with a phonetic decision tree. To enable multi-task learning, PyTorch-Kaldi supports multiple labels. For instance, it is possible to jointly load both context-dependent and context-independent targets and use the latter ones to perform monophone regularization . It is also possible to employ models based on an ecosystem of neural networks performing different tasks, as done in the context of joint training between speech enhancement and speech recognition or in the context of the recently-proposed cooperative networks of deep neural networks .
4 Chunk and Mini-batch Composition
PyTorch-Kaldi automatically splits the full dataset into a number of chunks, which are composed of labels and features randomly sampled from the full corpus. Each chunk is then stored into the GPU or CPU memory and processed by the neural training algorithm . The toolkit dynamically composes different chunks at each epoch. A set of mini-batches are then derived from them. Mini-batches are composed of few training examples that are used for gradient computation and parameter optimization.
The way mini-batches are gathered strongly depends on the typology of the neural network. For feed-forward models, the mini-batches are composed of randomly shuffled features and labels sampled from the chunk. For recurrent networks, the minibatches must be composed of full sentences. Different sentences, however, are likely to have different duration, making zero-padding necessary to form mini-batches of the same size. PyTorch-Kaldi sorts the speech sequences in ascending order according to their lengths (i.e., short sentences are processed first). This approach minimizes the need of zero-paddings and turned out to be helpful to avoid possible biases on batch normalization statistics. Moreover, it has been shown useful to slightly boost the performance and to improve the numerical stability of gradients.
5 DNN acoustic modeling
Each minibatch is processed by a neural network implemented with PyTorch, that takes as input the features and as outputs a set of posterior probabilities over the context-dependent phone states. The code is designed to easily plug-in customized models. As reported in the pseudo-code reported in Fig. 2, the new model can be simply defined by adding a new class into the neural_nets.py. The class must be composed of an initialization method, that specifies the parameters with their initialization, and a forward method that defines the computations to perform.
As an alternative, a number of pre-defined state-of-the-art neural models are natively implemented within the toolkit. The current version supports standard MLPs, CNNs, RNNs, LSTM, and GRU models. Moreover, it supports some advanced recurrent architectures, such as the recently-proposed Light GRU and twin-regularized RNNs . The SincNet model is also implemented to perform speech recognition from raw waveform directly. The hyperparameters of the model (such as learning rate, number of neurons, number of layers, dropout factor, etc.) can be tuned using a utility that implements the random search algorithm .
6 Decoding and Scoring
The acoustic posterior probabilities generated by the neural network are normalized by their prior before feeding the HMM-based decoder of Kaldi. The decoder merges the acoustic scores with the language probabilities derived by an n-gram language model and tries to retrieve the sequence of words uttered in the speech signal using a beam-search algorithm. The final Word-Error-Rate (WER) score is computed with the NIST SCTK scoring toolkit.
Experimental Setup
In the following sub-sections, the corpora, and the DNN setting adopted for the experimental activity are described.
The first set of experiments was performed with the TIMIT corpus, considering the standard phoneme recognition task (aligned with the Kaldi s5 recipe ).
To validate our model in a more challenging scenario, experiments were also conducted in distant-talking conditions with the DIRHA-English datasetThis dataset is distributed by the Linguistic Data Consortium (LDC). . Training was based on the original WSJ-5k corpus (consisting of sentences uttered by speakers) that was contaminated with a set of impulse responses measured in a domestic environment . The test phase was carried out with the real part of the dataset, consisting of WSJ sentences uttered in the aforementioned environment by six native American speakers.
Additional experiments were conducted with the CHiME 4 dataset , that is based on speech data recorded in four noisy environments (on a bus, cafe, pedestrian area, and street junction). The training set is composed of noisy WSJ sentences recorded by five microphones (arranged on a tablet) and uttered by a total of speakers. The test set ET-real considered in this work is based on real sentences uttered by four speakers, while the subset DT-real has been used for hyperparameter tuning. The CHiME experiments were based on the single channel setting .
Finally, experiments were performed with the LibriSpeech dataset. We used the training subset composed of hours and the dev-clean set for the hyperparameter search. Test results are reported on the test-clean part using the fglarge decoding graph inherited from the Kaldi s5 recipe.
2 DNN setting
The experiments consider different acoustic features, i.e., MFCCs ( static++), log-mel filter-bank features (FBANKS), as well as fMLLR features (extracted as reported in the s5 recipe of Kaldi), that were computed using windows of ms with an overlap of ms. The feed-forward models were initialized according to the Glorot’s scheme , while recurrent weights were initialized with orthogonal matrices . Recurrent dropout was used as a regularization technique . Batch normalization was adopted for feed-forward connections only, as proposed in . The optimization was done using the RMSprop algorithm running for epochs. The performance on the development set was monitored after each epoch and the learning rate was halved when the relative performance improvement went below . The main hyperparameters of the model (i.e., learning rate, number of hidden layers, hidden neurons per layer, dropout factor, as well as the twin regularization term ) were tuned on the development datasets.
Baselines
In this section, we discuss the baselines obtained with TIMIT, DIRHA, CHiME, and LibriSpeech datasets. As a showcase to illustrate the main functionalities of the PyTorch-Kaldi toolkit, we first report the experimental validation conducted on TIMIT.
Table 1 shows the performance obtained with several feed-forward and recurrent models using different features. To ensure a more accurate comparison between the architectures, five experiments varying the initialization seeds were conducted for each model and feature. The table thus reports the average phone error rates (PER)Standard deviations range between and for all the experiments..
Results show that, as expected, fMLLR features outperform MFCCs and FBANKs coefficients, thanks to the speaker adaptation process. Recurrent models significantly outperform the standard MLP one, especially when using LSTM, GRU, and Li-GRU architecture, that effectively address gradient vanishing through multiplicative gates. The best result (PER=%) is obtained with the Li-GRU model , that is based on a single gate and thus saves % of the computations over a standard GRU.
Table 2 details the impact of some popular techniques implemented in PyTorch-Kaldi for improving the ASR performance.
The first row (Baseline) reports the performance achieved with a basic recurrent model, where powerful techniques such as dropout and batch normalization are not adopted. The second row highlights the performance gain that is achieved when progressively increasing the sequence length during training. In this case, we started the training by truncating the speech sentence at steps (i.e, approximately 1 second of speech) and we progressively double the maximum sequence duration at every epoch. This simple strategy generally improves the system performance since it encourages the model to first focus on short-term dependencies and learn longer-term ones only at a later stage. The third row shows the improvement achieved when adding recurrent dropout. Similarly to , we applied the same dropout mask for all the time steps to avoid gradient vanishing problems. The fourth line, instead, shows the benefits derived from batch normalization . Finally, the last line shows the performance achieved when also applying monophone regularization . In this case, we employ a multi-task learning strategy by means of two softmax classifiers: the first one estimates context-dependent states, while the second one predicts monophone targets. As observed in , our results confirm that this technique can successfully be used as an effective regularizer.
The experiments discussed so far are based on single neural models. In Table 3 we compare our best Li-GRU system with a more complex architecture based on a combination of feed-forward and recurrent models fed by a concatenation of features. To the best of our knowledge, the PER=% achieved by the latter system yields the best-published performance on the TIMIT test-set.
Previous achievements were based on features computed with Kaldi. However, within PyTorch-Kaldi users can employ their own features. Table 4 shows the results achieved with convolutional models fed by standard FBANKs coefficients or by the raw waveform directly.
The standard CNN based on raw samples performs similarly to the one fed by FBANK features. A performance improvement is observed with SincNet , whose effectiveness in speech recognition is here highlighted for the first time.
We now extend our experimental validation to other datasets. With this regard, Table 5 shows the performance achieved on DIRHA, CHiME, and Librispeech (h) datasets.
The Table consistently shows better performance with the Li-GRU model, confirming our previous achievements on TIMIT. The results on DIRHA and CHiME show the effectiveness of the proposed toolkit also in noisy condition. To give a comparison, the best Kaldi baseline proposed in has a WER(%)=18.1%. An end-to-end system trained with ESPnet reaches a WER(%)=44.99%, confirming how critical is end-to-end speech recognition is challenging acoustic conditions. DIRHA represents another very challenging task, that is characterized by the presence of considerable levels of noise and reverberation. The WER=% obtained on this dataset represents the best performance published so-far on the single-microphone task. Finally, the performance obtained with Librispeech outperforms the corresponding p-norm Kaldi baseline () on the considered hours subset.
Conclusions
This paper described the PyTorch-Kaldi project, a new initiative that aims to bridge the gap between Kaldi and PyTorch. The toolkit is designed to make the development of an ASR system simpler and more flexible, allowing users to easily plug-in their customized acoustic models. PyTorch-Kaldi also supports combinations of neural architectures, features, and labels, allowing users to possibly employ complex ASR pipelines. The experiments have confirmed that PyTorch-Kaldi can achieve state-of-the-art results in some popular speech recognition tasks and datasets.
The current version of the PyTorch-Kaldi is already publicly-available along with a detailed documentation. The project is still in its initial phase and we invite all potential contributors to participate in it. We hope to build a community of developers larger enough to progressively maintain, improve, and expand the functionalities of our current toolkit. In the future, we plan to increase the number of pre-implemented models, support neural language model training/rescoring, sequence discriminative training, online speech recognition, as well end-to-end training.
Acknowledgment
We would like to thank Maurizio Omologo, Enzo Telk, and Antonio Mazzaldi for their helpful comments. This research was enabled in part by support provided by Calcul Québec and Compute Canada.