End-to-End Speech Recognition From the Raw Waveform

Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert, Emmanuel Dupoux

Introduction

State-of-the-art speech recognition systems rapidly shift from the paradigm of composite subsystems trained or designed independently to the paradigm of end-to-end training. While most of the work in this direction has been devoted to learning the acoustic model directly from sequences of phonemes or characters without intermediate alignment step or phone-state/senome induction, the other end of the pipeline model – namely, learning directly from the waveform rather than from speech features such as mel-filterbanks or MFCC – has recently received attention , but the performances on the master task of speech recognition still seem to be lagging behind those of models trained on speech features .

Yet, promising results have already been obtained by learning the front-end of speech recognition systems. We focus the discussion on trainable components that can be plugged in as replacement of mel-filterbanks without modification of the acoustic model. The approach inspired by gammatone filterbanks of Hoshen et al. and Sainath et al. achieved similar or better results than comparable mel-filterbanks on multichannel speech recognition and on far-field/noisy recording conditions. More recently, Zeghidour et al. proposed an alternative learnable architecture based on a convolutional architecture that computes a scattering transform and can be initialized as an approximation of mel-filterbanks, and obtained promising results on end-to-end phone recognition on TIMIT. However, these approaches have not been proven to improve on speech features on large-scale, end-to-end speech recognition in clean recording conditions on English – admittedly one of the tasks for which mel-filterbanks have been the most extensively tuned.

We present a systematic comparison of the two previous architectures of learnable filterbanks, which we will (coarsely) refer to as gammatone-based and scattering-based, and evaluate them against mel-filterbanks within an end-to-end training pipeline on letter error rate and word error rate on the Wall Street Journal dataset. Our main contributions and results are the following:

A mean-variance normalization layer on top of the log nonlinearity of learnable filterbanks appears to be critical for the efficient learning of the gammatone-based architecture, and makes the training of the scattering-based architecture faster;

The low-pass filter previously used in the scattering-based learnable filterbanks stabilizes the training of gammatone filterbanks, compared to the max-pooling that was originally proposed ;

For scattering-based trainable filterbanks, keeping the low-pass filter fixed during training allows to efficiently learn the filters from a random initialization, whereas the results of with random initialization of both the filters and the low-pass filter showed poor performances compared to a suitable initialization;

Both trainable filterbanks improve against the mel-filterbanks baseline on word error rate on the Wall Street Journal dataset, in similar conditions (same number of filters, same end-to-end training convolutional architecture). This is the first time learnable filterbanks improve against a strong mel-filterbanks baseline on a large vocabulary, speech recognition task under clean recording conditions.

The next section describes the learnable filterbanks architectures. Then, we present the end-to-end convolutional architecture used to perform the comparisons, and analyze the results of our comparative studies.

Learning filterbanks from raw speech

The two approaches that we consider for learning filterbanks from the raw waveform can be used as direct replacement for mel-filterbanks in any end-to-end learning pipeline: they are convolutional architectures that take the raw waveform as input and output $40$ channels every $10$ ms. As such, they can directly be compared with standard mel-filterbanks, simply by changing the features stage of a neural-network-based acoustic model. The filters are then nothing more than an additional layer to the neural network and are learnt by backpropagation with the rest of the acoustic model.

The first architecture we consider is inspired by , the second one is taken from . They are described in Table 1.

In both architectures, a convolutional layer with window length $25$ ms (to match the standard frame size used in mel-filterbanks) is applied with a stride of $1$ sample, and is followed by a nonlinearity to give $40$ output channels for each sample. Then, a pooling operator of width $25$ ms with a stride of $10$ ms performs low-pass filtering and decimation. Finally, a log non-linearity reproduces the dynamic range compression of log mel-filterbanks. The parameters to be learnt are the convolution filters, and possibly the weights of the low-pass filters.

The two architectures differ by the choices of each layer of computation. Hoshen et al. and Sainath et al. use $40$ real-valued filters with ReLU non-linearity, and rely on gammatones as filter values to approximate mel-filterbanks . In their work, they use a max-pooling operator for low-pass filtering. In contrast, Zeghidour et al. use $40$ complex-valued filters with a square modulus operator as non-linearity. Low-pass filtering is then performed by multiplying each output channel by a squared Hanning window so that, when using suitable Gabor wavelets as convolution filters, the architecture closely approximates mel-filterbanks computed on the power spectrum .

The number of filters ( $40$ ), the convolution and pooling width of $25$ ms, as well as the decimation of $10$ ms are not necessarily the optimal parameters of either trainable architecture, but these are the standard settings of mel-filterbanks (and likely the best settings for these features on standard speech recognition datasets). We keep these values fixed for the trainable architectures, so that the comparison to mel-filterbanks is carried out in the setting most favorable for the non-learnable baseline.

In the next subsections, we describe the improvements we propose for these architectures: the low-pass filter and the addition of instance normalization.

The original papers describing the gammatone-based trainable filterbanks used max-pooling as low-pass filter, whereas the scattering-based approach uses a squared Hanning window per channel. To make sure the low-pass filter is not responsible for notable differences between the two approaches we experiment with the squared Hanning window on both architectures. For both architectures, we also propose to keep this low-pass filter fixed while learning the convolution filter weights, a setting that was not explored by Zeghidour et al. , who learnt the low-pass filter weights when randomly initializing the convolutions.

2 Instance normalization

More importantly, we noticed that a per-channel per-sentence mean-variance normalization after log-compression is important for the baseline mel-filterbanks. Consequently, we propose to add a mean-variance normalization layer on both trainable architectures, performed for each of the $40$ channels independently on each sentence. Coincidently, this corresponds to an instance normalization layer , which has been shown to stabilize training in other deep learning contexts.

Experimental setup

The experiments compare different versions of the trainable architectures against log mel-filterbanks on a single deep convolutional network architecture for the acoustic model. The experiments are carried out on the open vocabulary task of the Wall Street Journal dataset , using the subset si284 for training, nov93-dev for validation, and nov92-eval for testing. Training is performed end-to-end on letters. We evaluate in both letter and word error rates. All our experiments use the open source code of wav2letter . In the next subsections, we describe the model, the different variants we tested and the hyperparameters.

Taking either log mel-filterbanks or trainable filterbanks, the acoustic model is a convolutional network with gated linear units (GLU) trained to predict sequences of letters, following . The model is a smaller version of the convolutional network used in since they train on the larger LibriSpeech dataset. Using the syntax C-input channels-output channels-width, the architecture we use has the structure C-40-200-13/C-100-200-3/C-100-200-4/C-100-250-5/ C-125-250-6/C-125-300-7/C-150-350-8/C-175-400-9/ C-200-450-10/C-225-500-11/C-250-500-12/C-250-500-13/ C-250-600-14/C-300-600-15/C-300-750-21/C-375-1000-1. All convolutions have stride $1$ . The number of input channels of the $n+1$ th convolution is half the size of the output of the $n$ -th convolution because of the GLU. There are GLU layers with a dropout of $0.25$ after each convolution layer. There is an additional linear layer to predict the final letter probabilities. When predicting letters, the training and decoding are performed as in . When predicting words, we use a 4-gram language model trained on the standard LM data of WSJ and perform beam search decoding, as in .

2 Variants

We compare the two architectures of trainable filterbanks along different axes: how to initialize the convolutions of the trainable filterbanks, the low-pass filter, and instance normalization.

random (rand), or with gammatone filters (gamm) that match the impulse response of a reference open source implementation of gammatones ;

max-pooling as in , or the squared Hanning window (Han-fixed).

2.2 Scattering-based architecture

random (rand), or Gabor filters (scatt) as described in Section 2.2 of ;

the squared Hanning window (Han-fixed), or a low-pass filter of same width and stride initialized with the weights of the squared Hanning window but the weights are then learnt by backpropagation (Han-learnt).

3 Hyperparameters and training

For models trained on the raw waveform, the signal was first normalized with mean/variance normalization by sequence. The network is trained with stochastic gradient descent and weight normalization for all convolutional layers except the front-ends. First, $80$ epochs are performed with a learning rate of $1.4$ , then training is resumed for $80$ additional epochs with a learning rate of $0.1$ . These hyperparameters were chosen from preliminary experiments as they seemed to work well for all architectures. Additional hyperparameters are the momentum and the learning rate for the training criterion, respectively chosen in $\{0,0.9\}$ and $\{0.001,0.0001\}$ .

For Letter Error Rate (LER) evaluations, the hyperparameters are selected using the LER on the validation set, validating every epoch. For Word Error Rate (WER) evaluations, the hyperparameters are chosen on the validation set using the WER, validating every $10$ epochs. The model selected on LER is also included for validation. The additional hyperparameters are the weight of the language model and the weight of word insertion penalty (see for details). We set them between $5$ and $8$ by steps of $0.5$ , and between $-2$ and $0.5$ by steps of $0.1$ , respectively. For hyperparameter selection, the beam size of the decoder is set to $2,500$ ; the final performances are computed with the selected hyperparameters but using a beam size of $25,000$ .

Experiments

Table 4 contains our results together with end-to-end baselines from the literature. is the current state-of-the-art on the WSJ dataset; it is given as a topline but uses much more training data ( $\sim 12,000h$ of speech) so the results are not comparable. are representative results in terms of WER and LER from the literature of end-to-end models trained on speech features from 2014-2017, in chronological order. and are the current state-of-the-art in LER on speech features and from the waveform respectively. These comparisons validate our baseline model trained on mel-filterbanks as a strong baseline in light of recent results, as it outperforms the state-of-the-art in LER by a significant margin ( $4.9\%$ vs $6.1\%$ for ), and achieves a test WER of $6.6\%$ , better than all other end-to-end baselines ( and report WER that are below our $6.6\%$ but are on easier closed vocabulary tasks).

As described in Section 2.2, we evaluate the integration of instance normalization after the log-compression in the trainable filterbanks, which was not used in previous work but is used in our baseline. Figure 1 shows training LER as a function of the number of epochs for scattering-based and gammatone-based filterbanks models, with and without instance normalization. We can see that this normalization drastically improves the training stability of the gammatone-based model, while it moderately improves the scattering-based model. We observed a positive impact of instance normalization in all settings, and so only report as a reference the results of our implementation of a vanilla gammatone-based trainable filterbanks following . Comparing gammatone (learnt)/gamm/max-pool without instance norm (under SOTA – waveform) to the results of gammatone (learnt)/gamm/max-pool in Table 4, we see a significant improvements of both LER and WER due to instance normalization, with an absolute reduction in LER and WER of $1.5\%$ and $2.8\%$ respectively.

For low-pass filtering, we first compare the Han-fixed setting to max-pooling for gammatone-based filterbanks (as max-pooling was previously used in ), and to Han-learnt for scattering, all with instance normalization. The tendency is that the Han-fixed setting consistently improves the results in LER and WER of both trainable filterbanks. More importantly, using either an Han-fixed or Han-learnt filter when learning scattering-based filterbanks from a random initialization removes the gap in performance with the Gabor wavelet initialization that was observed in where the lowpass filter was also initialized randomly. This is an important result since carefully initializing the convolutional filters is both technically non-trivial, and also relies on the prior knowledge of mel-filterbanks. We believe the ability to use random initialization is an important first step for more extensive tuning of trainable filterbanks (e.g., trying different numbers of filters, decimation or convolution width).

Compared to the literature, replacing the max-pooling by a low-pass filter and adding an instance normalization layer leads to a $23\%$ relative improvement in LER and a $33\%$ relative improvement in WER on nov92-eval on the gammatone-based trainable filterbanks, a significant improvement compared to the existing approach . Our models trained on the waveform also exhibit a gain in performance in LER of $22-31\%$ relative compared to the state-of-the-art end-to-end model trained on the waveform with its first 6 layers being pre-trained for mel-filterbanks reconstruction , and outperform various end-to-end models trained on speech features, both in LER and WER .

Comparing both trainable filterbanks with instance normalization to the log mel-filterbanks baseline, we observe that the performances of the Han-fixed settings and of the mel-filterbanks are comparable in terms of LER. However, we observe a consistent improvement in terms of WER of all trainable filterbanks. To the best of our knowledge, this is the first time a significant improvement in terms of WER relatively to comparable mel-filterbanks has been shown on a large vocabulary task under clean recording conditions. Some improvements on the clean test of the Switchboard dataset have previously been observed by , but their comparison point is MFCC rather than mel-filterbanks and the number of filters of the trainable architecture differs from their MFCC baseline.

The first step in the computation of mel-filterbanks is typically the application of a pre-emphasis layer to the raw signal. Pre-emphasis is a convolution with a first-order high-pass filter of the form $y[n]=x[n]-\alpha x[n-1],$ with $\alpha$ typically equal to $0.97$ . This operation can be performed by a convolutional layer of kernel size $2$ and stride $1$ , that can be plugged below time-domain filterbanks, initialized with weights $[-0.97\quad 1]$ , then learned with the network. In Table 4, we compare the performance of identical models (all using a fixed Hanning window, and a gammatone or scattering initialization) with and without pre-emphasis. We observe a gain on both LER and WER (except on nov93-dev WER/scatt) when using pre-emphasis.

This paper presents a systematic study of two approaches for trainable filterbanks, which clarifies good practices and identifies better architectures to learn from raw speech. Our results show that adding an instance normalization layer on top of the trainable filterbanks is critical for learning gammatone-based architectures, and speeds up learning of scattering-based architectures. Second, the use of a fixed squared Hanning window as low-pass filter is critical to learn the scattering-based filterbanks from random initialization of the filters, and improves on max-pooling for gammatone-based filterbanks. With these two improvements, we observe a consistent reduction of WER against comparable mel-filterbanks on the open vocabulary task of the WSJ dataset, in the setting of speech recognition under clean recording condition – most likely the setting on which mel-filterbanks have been the most heavily tuned.

This research was partially funded by the European Research Council (ERC-2011-AdG-295810 BOOTPHON), the Agence Nationale pour la Recherche (ANR-10-LABX-0087 IEC, ANR-10-IDEX-0001-02 PSL*).