Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Ying Zhang, Mohammad Pezeshki, Philemon Brakel, Saizheng Zhang, Cesar Laurent Yoshua Bengio, Aaron Courville

Introduction

Recently, Convolutional Neural Networks (CNNs) have achieved great success in acoustic modeling . In the context of Automatic Speech Recognition, CNNs are usually combined with HMMs/GMMs , like regular Deep Neural Networks (DNNs), which results in a hybrid system . In the typical hybrid system, the neural net is trained to predict frame-level targets obtained from a forced alignment generated by an HMM/GMM system. The temporal modeling and decoding operations are still handled by an HMM but the posterior state predictions are generated using the neural network.

This hybrid approach is problematic in that training the different modules separately with different criteria may not be optimal for solving the final task. As a consequence, it often requires additional hyperparameter tuning for each training stage which can be laborious and time consuming. Furthermore, these issues have motivated a recent surge of interests in training ‘end-to-end’ systems . End-to-end neural systems for speech recognition typically replace the HMM with a neural network that provides a distribution over sequences directly. Two popular neural network sequence models are Connectionist Temporal Classification (CTC) and recurrent models for sequence generation .

To the best of our knowledge, all end-to-end neural speech recognition systems employ recurrent neural networks in at least some part of the processing pipeline. The most successful recurrent neural network architecture used in this context is the Long Short-Term Memory (LSTM) . For example, a model with multiple layers of bi-directional LSTMs and CTC on top which is pre-trained with the transducer networks obtained the state-of-the-art on the TIMIT dataset. After these successes on phoneme recognition, similar systems have been proposed in which multiple layers of RNNs were combined with CTC to perform large vocabulary continuous speech recognition . It seems that RNNs have become somewhat of a default method for end-to-end models while hybrid systems still tend to rely on feed-forward architectures.

While the results of these RNN-based end-to-end systems are impressive, there are two important downsides to using RNNs/LSTMs: (1) The training speed can be very slow due to the iterative multiplications over time when the input sequence is very long; (2) The training process is sometimes tricky due to the well-known problem of gradient vanishing/exploding . Although various approaches have been proposed to address these issues, such as data/model parallelization across multiple GPUs and careful initializations for recurrent connections , those models still suffer from computationally intensive and otherwise demanding training procedures.

Inspired by the strengths of both CNNs and CTC, we propose an end-to-end speech framework in which we combine CNNs with CTC without intermediate recurrent layers. We present experiments on the TIMIT dataset and show that such a system is able to obtain results that are comparable to those obtained with multiple layers of LSTMs. The only previous attempt to combine CNNs with CTC that we know about , led to results that were far from the state-of-the-art. It is not straightforward to incorporate CNN into an end-to-end manner since the task may require the model to incorporate long-term dependencies. While RNNs can learn these kind of dependencies and have been combined with CTC for this very reason, it was not known whether CNNs were able to learn the required temporal relationships.

In this paper, we argue that in a CNN of sufficient depth, the higher-layer features are capable of capturing temporal dependencies with suitable context information. Using small filter sizes along the spectrogram frequency axis, the model is able to learn fine-grained localized features, while multiple stacked convolutional layers help to learn diverse features on different time/frequency scales and provide the required non-linear modeling capabilities.

Unlike the time windows applied in DNN systems , the temporal modeling is deployed within convolutional layers, where we perform a 2D convolution similar to vision tasks, and multiple convolutional layers are stacked to provide a relatively large context window for each output prediction of the highest layer. The convolutional layers are followed by multiple fully connected layers and, finally, CTC is added on the top of the model. Following the suggestion from , we only perform pooling along the frequency band on the first convolutional layer. Specifically, we evaluate our model on phoneme recognition for the TIMIT dataset.

Convolutional Neural Networks

Most of the CNN models in the speech domain have large filters and use limited weight sharing which splits the features into limited frequency bands while performing convolution separately and the convolution is usually applied with no more than 3 layers. In this section, we describe our CNN acoustic model whose architecture is different from the above. The complete CNN includes stacked convolutional and pooling layers, at the top of which are multiple fully-connected layers.

Since CNNs are adept at modeling local structures in the inputs, we use log mel-filter-bank (plus energy term) coefficients with deltas and delta-deltas which preserve the local correlations of the spectrogram.

The symbol $*$ denotes the convolution operation and $b_{i}$ is a bias parameter. There are three points that are worth mentioning:(1) The sequence length $f_{H}$ of $H$ after convolution is guaranteed to be equal to the input $\mathbf{X}$ ’s sequence length $f$ by applying zero padding along the frame axis before each convolution; (2) The convolution stride is chosen to be $1$ for all the convolution operations in our model; (3) We do not use limited weight sharing which splits the frequency bands into groups of limited bandwidths and convolution is done within each group separately. Instead, we perform the convolution over $\mathbf{X}$ not only along the frequency axis but also along the time axis, which results in a simple 2D convolution commonly used in computer vision.

2 Activation Function

The pre-activation feature maps $\mathbf{H}$ are passed through non-linear activation functions. We introduce three activation functions in the following and show their functionalities in the convolutional layer as an example, notice that all the operations below are element-wise.

Rectifier Linear Unit (ReLU) is a piece-wise linear activation function that outputs zero if the input is negative and outputs the input itself otherwise. Formally, given single feature map $\mathbf{H}_{i}$ , a ReLU function is defined as follows:

2.2 Parametric Rectifier Linear Unit

The Parametric Rectifier Linear Unit (PReLU) is an extension of the ReLU in which the output of the model in the regions that input is negative is a linear function of the input with a slope of $\alpha$ . PReLU is formalized as:

The extra parameter $\alpha$ is usually initialized to 0.1 and can be trained using backpropagation.

2.3 Maxout

where for $\mathbf{H}^{\prime}_{i}$ and $\mathbf{H}^{\prime\prime}_{i}$ we have:

which are two linear feature map candidates after the convolution, and $\mathbf{X}$ is the input of the convolutional layer at $\mathbf{H}_{i}$ . Figure 2 depicts the ReLU, PReLU, and Maxout activation functions.

3 Pooling

Connectionist Temporal Classification

Consider any sequence to sequence mapping task in which $\textbf{X}=\{X_{1},...,X_{T}\}$ is the input sequence and $\textbf{Z}=\{Z_{1},\cdots,Z_{L}\}$ is the target sequence. In the case of speech recognition, X is the acoustic signal and Z is a sequence of symbols. In order to train the neural acoustic model, $Pr(\textbf{Z}|\textbf{X})$ must be maximized for each input-output pair.

One way to provide a distribution over variable length output sequences given some much longer input sequence, is to introduce a many-to-one mapping of latent variable sequences $\mathbf{O}=\{O_{1},\cdots,O_{T}\}$ to shorter sequences that serve as the final predictions. The probability of some sequence Z can then be defined to be the sum of the probabilities of all the latent sequences that map to that sequence. Connectionist Temporal Classification (CTC) specifies a distribution over latent sequences by applying a softmax function to the output of the network for every time step, which provides a probability for emitting each label from the alphabet of output symbols at that time step $Pr(O_{t}|\mathbf{X})$ . An extra blank output class ‘-’ is introduced to the alphabet for the latent sequences to represent the probability of not outputting a symbol at a particular time step. Each latent sequence sampled from this distribution can now be transformed into an output sequence using the many-to-one mapping function $\sigma(\cdot)$ which first merges the repetitions of consecutive non-blank labels to a single label and subsequently removes the blank labels as shown in Equation 7:

Therefore, the final output sequence probability is a summation over all possible sequences $\pi$ that yield to Z after applying the function $\sigma$ :

A dynamic programming algorithm similar to the forward algorithm for HMMs is used to compute the sum in Equation 8 in an efficient way. The intermediate values of this dynamic programming can also be used to compute the gradient of $\ln Pr(\mathbf{Z}|\mathbf{X})$ with respect to the neural network outputs efficiently.

To generate predictions from a trained model using CTC, we use the best path decoding algorithm. Since the model assumes that the latent symbols are independent given the network outputs in the framewise case, the latent sequence with the highest probability is simply obtained by emitting the most probable label at each time-step. The predicted sequence is then given by applying $\sigma(\cdot)$ to that latent sequence prediction:

Experiments

In this section, we evaluate the proposed model on phoneme recognition for the TIMIT dataset. The model architecture is shown in Figure 3.

We evaluate our models on the TIMIT corpus where we use the standard 462-speaker training set with all SA records removed. The 50-speaker development set is used for early stopping. The evaluation is performed on the core test set (including 192 sentences). The raw audio is transformed into 40-dimensional log mel-filter-bank (plus energy term) coefficients with deltas and delta-deltas, which results in 123 dimensional features. Each dimension is normalized to have zero mean and unit variance over the training set. We use 61 phone labels plus a blank label for training and then the output is mapped to 39 phonemes for scoring.

2 Models

Our best model consists of 10 convolutional layers and 3 fully-connected hidden layers. Unlike the other layers, the first convolutional layer is followed by a pooling layer, which is described in section 2. The pooling size is $3\times 1$ , which means we only pool over the frequency axis. The filter size is $3\times 5$ across the layers. The model has 128 feature maps in the first four convolutional layers and 256 feature maps in the remaining six convolutional layers. Each fully-connected layer has 1024 units. Maxout with 2 piece-wise linear functions is used as the activation function. Some other architectures are also evaluated for comparison, see section 4.4 for more details.

3 Training and Evaluation

To optimize the model, we use Adam with learning rate $10^{-4}$ . Stochastic gradient descent with learning rate $10^{-5}$ is then used for fine-tuning. Batch size 20 is used during training. The initial weight values were drawn uniformly from the interval $[-0.05,0.05]$ . Dropout with a probability of $0.3$ is added across the layers except for the input and output layers . L2 norm with coefficient $1e-5$ is applied at fine-tuning stage. At test time, simple best path decoding (at the CTC frame level) is used to get the predicted sequences.

4 Results

Our model achieves $18.2\%$ phoneme error rate on the core test set, which is slightly better than the LSTM baseline model and the transducer model with an explicit RNN language model. The details are presented in Table 1. Notice that the CNN model could take much less time to train in comparison with the LSTM model when keeping roughly the same number of parameters. In our setup on TIMIT, we get $2.5\times$ faster training speed by using the CNN model without deliberately optimizing the implementation. We suppose that the gain of the computation efficiency might be more dramatic with a larger dataset.

To further investigate the different structural aspects of our model, we disentangle the analysis into three sub-experiments considering the number of convolutional layers, the filter sizes and the activation functions, as shown in table 1. It turns out that the model may benefit from (1) more layers, which results in more nonlinearities and larger input receptive fields for units in the top layers; (2) reasonably large context windows, which help the model to capture the spatial/temporal relations of input sequences in reasonable time-scales; (3) the Maxout unit, which has more functional freedoms comparing to ReLU and parametric ReLU.

Discussion

Our results showed that convolutional architectures with CTC cost can achieve results comparable to the state-of-the-art by adopting the following methodology: (1) Using a significantly deeper architecture that results in a more non-linear function and also wider receptive fields along both frequency and temporal axes; (2) Using maxout non-linearities in order to make the optimization easier; and (3) Careful model regularization that yields better generalization in test time, especially for small datasets such as TIMIT, where over-fitting happens easily.

We conjecture that the convolutional CTC model might be easier to train on phoneme-level sequences rather than the character-level. Our intuition is that the local structures within the phonemes are more robust and can easily be captured by the model. Additionally, phoneme-level training might not require the modeling of many long-term dependencies in comparison with character-level training. As a result, for a convolutional model, learning the phonemes structure seems to be easier, but empirical research needs to be done to investigate if this is indeed the case.

Finally, an important point that favors convolutional over recurrent architectures is the training speed. In a CNN, the training time can be rendered virtually independent of the length of the input sequence due to the parallel nature of convolutional models and the highly optimized CNN libraries available . Computations in a recurrent model are sequential and cannot be easily parallelized. The training time for RNNs increases at least linearly with the length of the input sequence.

Conclusions

In this work, we present a CNN-based end-to-end speech recognition framework without recurrent neural networks which are widely used in speech recognition tasks. We show promising results on the TIMIT dataset and conclude that the model has the capacity to learn the temporal relations that are required for it to be integrated with CTC. We already observed a gain in computational efficiency on the TIMIT dataset and training the model on large vocabulary datasets and integrate with the language model would be a part of our further study. Another interesting direction is to apply Batch Normalization to the current model.

Acknowledgements

The experiments were conducted using Theano , Blocks and Fuel . The authors would like to acknowledge the funding support from Samsung, NSERC, Calcul Quebec, Compute Canada, the Canada Research Chairs and CIFAR. The authors would like to thank Dmitriy Serdyuk, Dzmitry Bahdanau, Arnaud Bergeron, and Pascal Lamblin for their helpful comments.