Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

Awni Hannun, Ann Lee, Qiantong Xu, Ronan Collobert

Introduction

Sequence-to-sequence models with attention have been used for speech recognition since their inception in machine translation . These models have yielded state-of-the-art results in some settings , however; approaches such as CRF style end-to-end models and more traditional HMM based models are often superior.

While sequence-to-sequence models sometimes generalize well in speech recognition, they often come with a big hit to efficiency. The encoder typically consists of several layers of large bidirectional LSTMs . The decoder also uses a number of inefficient and sequential techniques. Efficiency is useful for fast training and evaluation times and is critical to the massive scale used in the semi-supervised and weakly supervised regimes .

In this work we develop a highly efficient sequence-to-sequence model which gives state-of-the-art results for non speaker adapted models on both LibriSpeech test sets . Key to our approach is a fully convolutional encoder with a time-depth separable (TDS) block structure. Our TDS convolution improves in WER over an RNN baseline and due to the parallel nature of the computation is much more efficient. We also discard slow and sequential techniques previously thought to be important to the accuracy of these models. These include neural content attention, location based attention, and scheduled sampling. In turn, we give more efficient alternatives.

Also key to our approach is a highly efficient and stable beam search inference procedure. Unlike previous work , accuracy does not degrade with very large beam sizes. This enables us to better leverage the constraint of a convolutional language model which gives substantial improvements in WER over a simple n-gram baseline.

Model

We consider an input utterance $X=[X_{1},\ldots,X_{T}]$ and an output transcription $Y=[y_{1},\ldots,y_{U}]$ . The sequence-to-sequence model encodes $X$ into a hidden representation and then decodes the hidden representation into a sequence of predictions for each output token. The encoder is given by

where $K=[K_{1},\ldots,K_{T}]$ are the keys and $V=[V_{1}\ldots,V_{T}]$ are the values. The decoder is given by

Here $g(\cdot)$ is an RNN which encodes the previous token and query vector $Q_{u-1}$ to produce the next query vector. The attention mechanism $\text{attend}(\cdot)$ produces a summary vector $S_{u}$ , and $h(\cdot)$ computes a distribution over the output tokens.

Our proposed time-depth separable (TDS) convolution block (see Figure 1) partially decouples the aggregation over time from the mixing over channels. This allows us to increase the receptive field of the model with a negligible increase in the number of parameters. In preliminary experiments we find that the TDS convolution block generalizes much better than other deep convolutional architectures and needs fewer parameters. Another benefit of our block structure is it can be implemented efficiently using a standard 2D convolution.

The block starts with a layer of 2D convolution which operates over an input of shape $T\times w\times c$ and produces an output of shape $T\times w\times c$ where $T$ is the number of time-steps, $w$ is the input width and $c$ is the number of input (and output) channels. The kernels are size $k\times 1$ . The total number of parameters in this layer is $kc^{2}$ which can be made small by keeping $c$ small. We follow the convolution with a ReLU non-linearity.

We then view the output of the convolution as $T\times 1\times wc$ and apply a fully-connected layer, which is a sequence of two $1\times 1$ convolutions (i.e. linear layers) with a ReLU non-linearity in between. We add residual connections and layer normalization after the convolution and the fully connected layer. The layer normalization is over all dimensions for a given example including time.

The TDS architecture has three sub-sampling layers each with a stride of 2 for a total sub-sampling factor of 8. We also increase the the number of output channels at each sub-sampling layer since we compress the information in time. For simplicity these layers do not have residual connections and are only followed by a ReLU and layer normalization.

2 Efficient Decoder

The decoder is sequential in nature since to compute the next output requires the previous prediction. However, at training time we use teacher forcing–the previous ground truth is used in place of the previous prediction. In principle, this allows us to compute all output frames simultaneously. The outputs of the RNN given by $g(\cdot)$ cannot be computed in parallel, however; unrolling the computation and making a single call to an efficient CuDNN implementation is much faster than calling $U$ separate kernels. After the following optimizations, the decoder accounts for less than 10% of the total iteration time.

Techniques such as scheduled sampling , input feeding and location-based attention introduce a sequential dependency in the decoder. We discard these techniques in favor of approaches which can be computed in parallel. We simply do not use input feeding and location-based attention as we find that we can achieve good WERs without them. We replace scheduled sampling with random sampling (section 2.2.1).

We use an inner-product key-value attention which can be implemented much more efficiently than a neural attention. For a single example the attention is given by

We scale the inner products by the inverse square root of their hidden dimension $d$ . This improves convergence and helps the model learn an alignment. However, we do not see a consistent improvement in generalization .

Scheduled sampling limits exposure bias by bringing the training conditions closer to the testing conditions. However, it introduces a sequential dependency in the decoder, since it sometimes uses the previous prediction at the next time-step.

Instead, we propose random sampling, where the previous prediction is replaced with a randomly sampled token . First we decide with probability $P_{\text{rs}}$ to sample a given input token. If we sample, then choose a new token from a uniform distribution. This allows us to vectorize the implementation as follows:

Sample $U$ random numbers $c_{j}$ uniformly from $$.

Sample a vector $Z$ of $U$ tokens. We use a uniform distribution over the output tokens not including end-of-sentence (EOS).

Construct $\hat{Y}=R\circ Z+(1-R)\circ Y$ .

As we show later, random sampling improves WER.

3 Soft Window Pre-training

We propose a simple soft attention window pre-training scheme to enable the training of very deep convolutional encoders. Compared to prior work , our approach is simple to implement, results in negligible additional computational expense, and needs very little tuning.

We encourage the model to align the output at uniform intervals along the input by penalizing attention values which are too far from the desired locations. Let $W$ be a $T\times U$ matrix with entries $W_{ij}=(i-\frac{T}{U}j)^{2}$ . The matrix $W$ encodes the (squared) distance between the $i$ -th input and the $j$ -th output assuming the outputs are spaced at uniform intervals along the input – hence the scaling factor $T/U$ . We apply $W$ to the attention as follows

The term $\sigma$ is a hyper-parameter which dampens the effect of $W$ . The application of $W$ is equivalent to multiplying the normalized attention vector (i.e. after the softmax) by a Gaussian shaped mask. In that respect, $\sigma$ is simply the standard deviation of the Gaussian.

We use the window pre-training for the first few epochs and then switch it off. This is sufficient to enable the model to learn an alignment and converge. In general $\sigma$ does not need to be tuned when model hyper-parameters change. An exception is when the amount of sub-sampling in the encoder changes, $\sigma$ should change accordingly.

4 Regularization

We use three additional forms of regularization to control overfitting and improve the generalization of the model.

First we apply dropout after each layer in each block of the encoder. We apply dropout after the non-linearity and prior to layer normalization. We do not use any dropout in the decoder.

4.2 Label Smoothing

We use label smoothing to reduce over-confidence in predictions. As in machine translation , we find that label smoothing hurts loss on the dev set but improves WER.

4.3 Word Piece Sampling

We use word pieces as outputs following the Unigram Language Model approach . During training, we sample word piece representations for a given transcription , but unlike prior work, we sample at the word-level instead of the sentence-level. For each word, with probability $1-P_{\text{wp}}$ we take the most likely word piece representation or with probability $P_{\text{wp}}$ uniformly sample over the top-ten most likely alternatives.

Beam Search Decoding

We use an open-vocabulary beam search decoder which optimizes the following objective

The term $|Y|$ counts the number of tokens in $Y$ . In the above, $\alpha$ is the LM weight and $\beta$ is the token insertion term.

Sequence-to-sequence beam search decoders are known to be unstable sometimes exhibiting worse performance with an increasing beam size . We use two techniques to stabilize the beam search. This allows our model to extract more value from the integration of an LM, since we can use a large beam size to effectively search over the space of possible hypotheses.

We do not allow the beam search to propose any hypotheses which attend more than $t_{\text{max}}$ frames away from the previous attention peak. In practice we find that $t_{\text{max}}$ only needs to be tuned once for a given data set and can otherwise remain unchanged.

1.2 End-of-sentence Threshold

In order to bias the search away from short transcriptions, we only consider end-of-sentence (EOS) proposals when the score is greater than a specified factor of the best candidate score

Like the hard attention limit, we find the parameter $\gamma$ only needs to be tuned once for a given data set.

2 Efficiency

We use a few heuristics to further improve the efficiency of the beam search. First, we set a beam threshold to prune hypotheses in the beam which are below a fixed range from the best hypothesis so far.

We also apply a threshold when proposing new candidate tokens to the current set of hypotheses in the beam. Similar to Equation 8, we require that the proposed token score satisfy

Finally, we batch compute the updated set of probabilities for every candidate in the beam, so only one forward pass is required at each step. These techniques result in a fast decoding time even with a deep convolutional LM and a large beam.

Experiments

We perform experiments on the full 960-hour LibriSpeech corpus . Our best encoder has two 10-channel, three 14-channel and six 18-channel TDS blocks. We use three 1D convolutions to sub-sample over time, one as the first layer and one in between each group of TDS blocks. Kernel sizes are all 21 $\times$ 1. A final linear layer produces the 1024-dimensional encoder output. The decoder is a one-layer GRU with 512 hidden units. Weights are initialized from a uniform distribution $\mathcal{U}(-\sqrt{4/f_{in}},\sqrt{4/f_{in}})$ , where $f_{in}$ is the fan-in to each unit.

Input features are 80-dimensional mel-scale filter banks computed every 10-ms with a 25-ms window. We use 10k word pieces computed from the SentencePiece toolkit as the output token set. All models are trained on 8 V100 GPUs with a batch size of 16 per GPU. We use synchronous SGD with a learning rate of 0.05, decayed by a factor of 0.5 every 40 epochs. We clip the gradient norm to 15. The model is pre-trained for three epochs with the soft window and $\sigma=4$ . We use 20% dropout, 5% label smoothing, 1% random sampling and 1% word piece sampling.

We train two word piece LMs on the 800M-word text-only data set. The first is a 4-gram trained with KenLM and the second is a convolutional LM (ConvLM) using the same model architecture and training strategy as . We use a beam size of 80, set $t_{\max}=30$ , the EOS penalty $\gamma=1.5$ and $\eta=10$ . The LM weight and token insertion terms are cross-validated with each dev set and LM combination. We use the wav2letter++ framework to train and evaluate our models .

Table 1 compares the TDS model with three other systems. The CAPIO system is a hybrid HMM-DNN with speaker adaptation . The other two are end-to-end models, one using the CRF-style ASG loss and the other a sequence-to-sequence model with an RNN encoder .

Our proposed model achieves a state-of-the-art for end-to-end systems of 3.28 WER on test clean and 9.84 WER on test other. Compared with the RNN-based encoder , the TDS model improves WER by 14.1% on test clean and 22.9% on test other with nearly a factor of 4 reduction in parameters (136M vs. 37M). The TDS model benefits more from an external LM. This could be due to (1) a better loss on the correct transcription and (2) a more effective beam search.

2 Model Variations

Table 2 shows results from varying the number of TDS blocks, the number of parameters, the word piece sampling probability and the amount of random sampling. For each setting we train three models and report the best and the average WER.

We reduce the number of parameters without changing the receptive field by reducing the number of channels in each group of TDS blocks from (10, 14, 18) to (10, 12, 14) or (10, 10, 10). The model is very sensitive to decreasing the number of parameters. We also examine the effect of varying the number of TDS blocks without changing the number of parameters or the receptive field. For 9 TDS blocks we use (14, 16, 20) channels with $k=27$ , and for 12 TDS blocks we use (10, 16, 16) channels with $k=19$ . We show that a small amount of word piece sampling is helpful. With a higher $P_{wp}$ the model sometimes converges poorly, likely due to the variability in the targets. A small amount of random sampling is also helpful. Finally, when we remove soft window pre-training, the model takes much longer to converge and achieves a worse result. The soft window clearly helps guide the attention early in training.

Figure 2 shows the effect of the receptive field on WER. There is a sharp increase in WER when the size of the receptive field drops below a threshold. Qualitative analysis shows that the high WER is often due to catastrophic errors such as looping and skipping, a common problem for sequence-to-sequence models . We hypothesize that without a large receptive field, the encoder keys do not have enough context to disambiguate queries from the decoder.

Figure 3 shows how WER changes with the size of the beam. While most of the gain from including an external LM comes even at small beam size, we see consistent improvements up to a beam size of 80, particularly on dev other.

3 Efficiency

We compare the TDS conv model to a strong RNN baseline in terms of training efficiency on LibriSpeech . The RNN baseline encoder consists of six bidirectional LSTMs. Both models have a total sub-sampling factor of 8. Our best TDS architecture can complete one full epoch over the LibriSpeech training set in 7 minutes. This is more than 10 $\times$ faster than our implementation of the RNN baseline and more than 4 $\times$ faster than the RNN baseline encoder but with the efficient decoder described in Section 2.2.

Our beam search runs at an average rate of 0.57 and 0.93 seconds-per-sample on dev clean and other with the 4-gram LM and a beam size of 80. With the ConvLM, times increase to 0.73 and 1.20 seconds-per-sample at the same beam size.

Related Work

Our work builds on a large body of work aimed at improving sequence-to-sequence models with attention for both speech recognition and other application domains. Fully convolutional encoders have worked well in machine translation . They have also given state-of-the-art results in speech recognition with more structured loss functions like the AutoSegCriterion . However, we are not aware of any competitive results with fully convolutional encoders in sequence-to-sequence models for speech recognition.

The high-level encoder architecture is similar to the Transformer model ; however, we consider convolutions instead of self-attention. Our architecture is inspired by and quite related to the lightweight convolution . An important idea of that work and ours is the separation of the integration over time from the mixing over channels which improves both accuracy and efficiency. Other than the application to speech, some differences in our encoder architecture are (1) the time-depth separable convolution can be implemented with a simple 2D convolution and (2) our models do not use any normalization over the time dimension of the kernels.

Depth-wise separable convolutions have been used to improve the efficiency and accuracy of computer vision models . The first layer of the TDS block can be seen as a grouped 1D convolution with $cw$ channels, a group size of $c$ , and weights tied between groups. Grouped convolutions have also been used in computer vision to improve efficiency for e.g. model-parallel training and classification accuracy .

Conclusion

We have shown that a fully convolutional encoder and a simple decoder can give superior results to a strong RNN baseline while being an order of magnitude more efficient. Key to the success of the convolutional encoder is a time-depth separable block structure which allows the model to retain a large receptive field. We also show how to integrate a strong convolutional LM with a stable and scalable beam search procedure.

Acknowledgements

Thanks to Michael Auli, Abdelrahman Mohamed, Tatiana Likhomanenko and Gabriel Synnaeve for helpful conversations.