Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition

Julian Salazar, Katrin Kirchhoff, Zhiheng Huang

Introduction

Connectionist temporal classification (CTC) has matured as a scalable, end-to-end approach to monotonic sequence transduction tasks like handwriting recognition , action labeling , and automatic speech recognition (ASR) , sidestepping the label alignment procedure required by traditional hidden Markov model plus neural network (HMM-NN) approaches . However, the most successful end-to-end approach to general sequence transduction has been the encoder-decoder with attention . Though first used in machine translation, its generality makes it useful to ASR as well . However, the lack of enforced monotonicity makes encoder-decoder ASR models difficult to train, often necessitating thousands of hours of data , careful learning rate schedules , pretraining , or auxiliary CTC losses to approach state-of-the-art results. The decoders are also typically autoregressive at prediction time , restricting inference speed.

Both approaches have conventionally used recurrent layers to model temporal dependencies. As this hinders parallelization, later works proposed partially- or purely-convolutional CTC models and convolution-heavy encoder-decoder models for ASR. However, convolutional models must be significantly deeper to retrieve the same temporal receptive field . Recently, the mechanism of self-attention was proposed, which uses the whole sequence at once to model feature interactions that are arbitrarily distant in time. Its use in both encoder-decoder and feedforward contexts has led to faster training and state-of-the-art results in translation (via the Transformer ), sentiment analysis , and other tasks. These successes have motivated preliminary work in self-attention for ASR. Time-restricted self-attention was used as a drop-in replacement for individual layers in the state-of-the-art lattice-free MMI model , an HMM-NN system. Hybrid self-attention/LSTM encoders were studied in the context of listen-attend-spell (LAS) , and the Transformer was directly adapted to speech in ; both are encoder-decoder systems.

In this work, we propose and evaluate fully self-attentional networks for CTC (SAN-CTC). We are motivated by practicality: self-attention could be used as a drop-in replacement in existing CTC-like systems, where only attention has been evaluated in the past ; unlike encoder-decoder systems, SAN-CTC is able to predict tokens in parallel at inference time; an analysis of SAN-CTC is useful for future state-of-the-art ASR systems, which may equip self-attentive encoders with auxiliary CTC losses . Unlike past works, we do not require convolutional frontends or interleaved recurrences to train self-attention for ASR. In Section 2, we motivate the model and relevant design choices (position, downsampling) for ASR. In Section 3, we validate SAN-CTC on the Wall Street Journal and LibriSpeech datasets by outperforming existing CTC models and most encoder-decoder models in character error rates (CERs), with fewer parameters or less training time. Finally, we train our models with different label alphabets (character, phoneme, subword), use WFST decoding to give word error rates (WERs), and examine the learned attention heads for insights.

Model architectures for CTC and ASR

In this way, paths are analogous to framewise alignments in the HMM-NN framework. CTC models the distribution of sequences by marginalizing over all paths corresponding to an output:

Finally, CTC models each $P({\bm{\uppi}}\mid{\mathbf{X}})$ non-autoregressively, as a sequence of conditionally-independent outputs:

This model assumption means each $P({\textnormal{\textpi}}_{t},t\mid{\mathbf{X}})$ could be computed in parallel, after which one can do prediction via beam search, or training with gradient descent using the objective $\mathcal{L}_{\text{CTC}}({\bm{X}},{\bm{y}})=\textstyle-\log P({\bm{y}}\mid{\bm{X}})$ ; the order-monotonicity of $\mathcal{B}$ ensures $\mathcal{L}_{\text{CTC}}$ can be efficiently evaluated with dynamic programming .

In practice, one models $P({\bm{\uppi}},{\textnormal{t}}\mid{\mathbf{X}})$ with a neural network. As inspired by HMMs, the model simplification of conditional independence can be tempered by multiple layers of (recurrent) bidirectional long short-term memory units (BLSTMs) . However, these are computationally expensive (Table 1), leading to simplifications like gated recurrent units (GRUs) ; furthermore, the success of the $\text{ReLU}(x)=\max(0,x)$ nonlinearity in preventing vanishing gradients enabled the use of vanilla bidirectional recurrent deep neural networks (BRDNNs) to further reduce operations per layer.

Convolutions over time and/or frequency were first used as initial layers to recurrent neural models, beginning with HMM-NNs and later with CTC, where they are viewed as promoting invariance to temporal and spectral translation in ASR , or image translation in handwriting recognition ; they also serve as a form of dimensionality reduction (Section 2.4). However, these networks were still bottlenecked by the sequentiality of operations at the recurrent layers, leading to propose row convolutions for unidirectional RNNs, which had finite lookaheads to enable online processing while having some future context.

2 Motivating the self-attention layer

A layer’s structure (Figure 1b) is composed of two sublayers. The first implements self-attention, where the success of attention in CTC and encoder-decoder models is parallelized by using each position’s representation to attend to all others, giving a contextualized representation for that position. Hence, the full receptive field is immediately available at the cost of $O(T^{2})$ inner products (Table 1), enabling richer representations in fewer layers.

We also see inspiration from convolutional blocks: residual connections, layer normalization, and tied dense layers with ReLU for representation learning. In particular, multi-head attention is akin to having a number of infinitely-wide filters whose weights adapt to the content (allowing fewer “filters” to suffice). One can also assign interpretations; for example, argue their LAS self-attention heads are differentiated phoneme detectors. Further inductive biases like filter widths and causality could be expressed through time-restricted self-attention and directed self-attention , respectively.

3 Formulation

4 Downsampling

5 Position

Self-attention is inherently content-based , and so one often encodes position into the post-embedding vectors. We use standard trigonometric embeddings, where for $0\leq i\leq d_{\text{emb}}/2$ , we define

for position $t$ . We consider three approaches: content-only , which forgoes position encodings; additive , which takes $d_{\text{emb}}=d_{\text{h}}$ and adds the encoding to the embedding; and concatenative, where one takes $d_{\text{emb}}=40$ and concatenates it to the embedding. The latter was found necessary for self-attentional LAS , as additive encodings did not give convergence. However, the monotonicity of CTC is a further positional inductive bias, which may enable the success of content-only and additive encodings.

Experiments

We take $(n_{\text{layers}},d_{\text{h}},n_{\text{heads}},d_{\text{ff}})=$ (10, 512, 8, 2048), giving $\scriptstyle\sim$ 30M parameters. This is on par with models on WSJ (10-30M) and an order of magnitude below models on LibriSpeech (100-250M) . We use MXNet for modeling and Kaldi/EESEN for data preparation and decoding. Our self-attention code is based on GluonNLP’s implementation. At train time, utterances are sorted by length: we exclude those longer than 1800 frames ( $\ll$ 1% of each training set). We take a window of 25ms, a hop of 10ms, and concatenate cepstral mean-variance normalized features with temporal first- and second-order differences.Rescaling so that these differences also have var. $\approx 1$ helped WSJ training. We downsample by a factor of $k=3$ (this also gave an ideal $T/k\approx d_{\text{h}}$ for our data; see Table 1).

We perform Nesterov-accelerated gradient descent on batches of 20 utterances. As self-attention architectures can be unstable in early training, we clip gradients to a global norm of 1 and use the standard linear warmup period before inverse square decay associated with these architectures . Let $n$ denote the global step number of the batch (across epochs); the learning rate is given by

where we take $\lambda=$ 400 and $n_{\text{warmup}}$ as a hyperparameter. However, such a decay led to early stagnation in validation accuracy, so we later divide the learning rate by $10$ and run at the decayed rate for 20 epochs. We do this twice, then take the epoch with the best validation score. Xavier initialization gave validation accuracies of zero for the first few epochs, suggesting room for improvement. Like previous works on self-attention, we apply label smoothing (see Tables 2, 3, 5; we also tried model averaging to no gain). To compute word error rates (WERs), we use the dataset’s provided language model (LM) as incorporated by WFST decoding to bridge the gap between CTC and encoder-decoder frameworks, allowing comparison with known benchmarks and informing systems that incorporate expert knowledge in this way (e.g., via a pronunciation lexicon).

We train both character- and phoneme-label systems on the 80-hour WSJ training set to validate our architectural choices. Similar to , we use 40-dim. mel-scale filter banks and hence 120-dim. features. We warmup for 8000 steps, use a dropout of 0.2, and switch schedules at epoch 40. For the WSJ dataset, we compare with similar MLE-trained, end-to-end, open-vocabulary systems in Table 2. We get an eval92 CER of 4.7%, outdoing all previous CTC-like results except 4.6% with a trainable frontend . We use the provided extended 3-gram LM to retrieve WERs. For phoneme training, our labels come from the CMU pronunciation lexicon (Table 3). These models train in one day (Tesla V100), comparable to the Speech Transformer ; however, SAN-CTC gives further benefits at inference time as token predictions are generated in parallel.

We also evaluate design choices in Table 4. Here, we consider the effects of downsampling and position encoding on accuracy for our fixed training regime. We see that unlike self-attentional LAS , SAN-CTC works respectably even with no position encoding; in fact, the contribution of position is relatively minor (compare with , where location in an encoder-decoder system improved CER by 3% absolute). Lossy downsampling appears to preserve performance in CER while degrading WER (as information about frame transitions is lost). We believe these observations align with the monotonicity and independence assumptions of CTC.

Inspired by , we plot the standard deviation of attention weights for each head as training progresses; see Figure 2 for details. In the first layers, we similarly observe a differentiation of variances, along with wide-context heads; in later layers, unlike we still see mild differentiation of variances. Inspired by , we further plot the attention weights relative to the current time position (here, per head). Character labels gave forward- and backward-attending heads (incidentally, averaging these would retrieve the bimodal distribution in ) at all layers. This suggests a gradual expansion of context over depth, as is often engineered in convolutional CTC. This also suggests possibly using fewer heads, directed self-attention , and restricted contexts for faster training (Table 1). Phoneme labels gave a sharp backward-attending head and more diffuse heads. We believe this to be a symptom of English characters being more context-dependent than phonemes (for example, emitting ‘tt’ requires looking ahead, as ‘–’ must occur between two runs of ‘t’ tokens).

2 LibriSpeech

We give the first large-scale demonstration of a fully self-attentional ASR model using the LibriSpeech ASR corpus , an English corpus produced from audio books giving 960 hours of training data. We use 13-dim. mel-freq. cepstral coeffs. and hence 39-dim. features. We double the warmup period, use a dropout of 0.1, and switch schedules at epoch 30. Using character labels, we attained a test-clean CER of 2.8%, outdoing all previous end-to-end results except OCD training . We use the provided 4-gram LM via WFST to compare WERs with state-of-the-art, end-to-end, open-vocabulary systems in Table 5. At this scale, even minor label smoothing was detrimental. We run 70 epochs in slightly over a week (Tesla V100) then choose the epoch with the best validation score for testing. For comparison, the best CTC-like architecture took 4-8 weeks on 4 GPUs for its results.https://github.com/facebookresearch/wav2letter/issues/11 The Enc-Dec+CTC model is comparable, taking almost a week on an older GPU (GTX 1080 Ti) to do its $\sim$ 12.5 full passes over the data.https://github.com/rwth-i6/returnn-experiments/tree/master/2018-asr-attention/librispeech/full-setup-attention

Finally, we trained the same model with BPE subwords as CTC targets, to get more context-independent units . We did 300 merge operations (10k was unstable) and attained a CER of 7.4%. This gave a WER of 8.7% with no LM (compare with Table 5’s LM-based entries), and 5.2% with a subword WFST of the LM. We still observed attention heads in both directions in the first layer, suggesting our subwords were still more context-dependent than phonemes.

Conclusion

We introduced SAN-CTC, a novel framework which integrates a fully self-attentional network with a connectionist temporal classification loss. We addressed the challenges of adapting self-attention to CTC and to speech recognition, showing that SAN-CTC is competitive with or outperforms existing end-to-end models on WSJ and LibriSpeech. Future avenues of work include multitasking SAN-CTC with other decoders or objectives, and streamlining network structure via directed or restricted attention.