Very Deep Self-Attention Networks for End-to-End Speech Recognition

Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, Sebastian Stüker, Alexander Waibel

Introduction

Recently, the sequence-to-sequence (S2S) approach in automatic speech recognition (ASR) has received a considerable amount of interest, due to the ability to jointly train all components towards a common goal which reduces complexity and error propagation compared to traditional hybrid systems. Traditional systems divide representation into different levels in the acoustic model, in particular separating global features (such as channel and speaker characteristics) and local features (on the phoneme level). The language model and acoustic model are trained with different loss functions and then combined during decoding. In contrast, neural S2S models perform a direct mapping from audio signals to text sequences based on dynamic interactions between two main model components, an encoder and a decoder, which are jointly trained towards maximizing the likelihood of the generated output sequence. The neural encoder reads the audio features into high-level representations, which are then fed into an auto-regressive decoder which attentively generates the output sequences .

In this context, we aim at reconsidering acoustic modeling within end-to-end models. Previous approaches in general had long short-term memory neural networks (LSTM) or time-delay neural networks operating on top of frame-level features to learn sequence-level representation. These neural networks are able to capture long range and local dependencies between different timesteps.

Recently, self-attention has been shown to efficiently represent different structures including text , images , and even acoustic signals with impressive results. The Transformer model using self-attention achieved the state-of-the-art in mainstream NLP tasks . The attractiveness of self-attention networks originates from the ability to establish a direct connection between any element in the sequence. Self-attention is able to scale with the length of the input sequence without any limiting factor such as, e.g., the kernel size of CNNs, or the vanishing gradient problem of LSTMs. Moreover, the self-attention network is also computationally advantageous compared to recurrent structures because intermediate states are no longer connected recurrently, leading to more efficient batched operations. As a result, self-attention networks can be reasonably trained with many layers leading to state-of-the-art performance in various tasks . Self-attention and the Transformer have been exploratorily applied to ASR, but so far with unsatisfactory results. found that self-attention in the encoder (acoustic model) was not effective, but combined with an LSTM brought marginal improvement and greater interpretability, while did not find any notable improvement using the Transformer in which the encoder combines self-attention with convolution/LSTM compared to other model architectures. In this work, we show that the Transformer requires little modification to adapt on the speech recognition task. Specifically, we exploit the advantages of self-attention networks for ASR such that both our acoustic encoder and character-generating decoder are constructed without any recurrence or convolution. This is the first attempt to propose this system architecture to the best of our knowledge, and we show that a competitive end-to-end ASR model can be achieved solely using standard training techniques from general S2S systems.

Our contributions are as follows. First, we show that depth is an important factor to acquire competitive end-to-end ASR models with the Transformer. Second, in order to facilitate training of very deep configurations, we propose a variation of stochastic depth for the Transformer inspired by the Stochastic Residual Network for image classification .

We discovered that its ability to regularize is the key contribution to obtain the state-of-the-art result among end-to-end ASR models for the standard 300h Switchboard (SWB) benchmark. This result is achieved using a total of 48 Transformer layers across the encoder and decoder. Our source code and final model are available at https://github.com/quanpn90/NMTGMinor/tree/audio-encoder/

Model Description

The main components of the model include an encoder, which consumes the source sequence and then generates a high-level representation, and a decoder generating the target sequence. The decoder models the data as a conditional language model - the probability of the sequence of discrete tokens is decomposed into an ordered product of distributions conditioned on both the previously generated tokens and the encoder representation.

Both encoder and decoder are neural networks and require neural components that are able to learn the relationship between the time steps in the input and output sequence. The decoder also requires a mechanism to condition on specific components of the encoder representation. For the Transformer, attention or its common variation multi-head attention, is the core of the model in place of recurrence.

2 Multi-Head Attention

Fundamentally, attention refers to the method of using a content-based information extractor from a set of queries $Q$ , keys $K$ and values $K$ . The retrieval function is based on similarities between the queries and the keys, and in turn returns the the weighted sum of the values as following:

Recently, improves dot-product attention by scaling the queries before hand and introducing sub-space projection for keys, queries and values into $n$ parallel heads, in which $n$ attention operations are performed with corresponding heads. The result is the concatenation of attention outputs from each head. Notably, unlike recurrent connections which use single states with gating mechanism to transfer data or convolution connections linearly combining local states limited in a kernel size, self-attention aggregates the information in all time-steps without any intermediate transformation.

3 Layer Architecture

The overall architecture is demonstrated in Figure 1. The encoder and decoder of the Transformers are constructed by layers, each of which contains self-attentional sub-layers coupled with feed-forward neural networks.

To adapt the encoder to long speech utterances, we follow the reshaping practice from by grouping consecutive frames into one step. Subsequently we combine the input features with sinusoidal positional encoding . While directly adding acoustic features to the positional encoding is harmful, potentially leading to divergence during training , we resolved that problem by simply projecting the concatenated features to a higher dimension before adding ( $512$ , as other hidden layers in the model). In the case of speech recognition specifically, the positional encoding offers a clear advantage compared to learnable positional embeddings , because the speech signals can be arbitrarily long with a higher variance compared to text sequences.

The Transformer encoder passes the input features to a self-attention layer followed by a feed-forward neural network with 1 hidden layer with the ReLU activation function. Before these sub-modules, we follow the original work to include residual connections which establishes short-cuts between the lower-level representation and the higher layers. The presence of the residual layer massively increases the magnitude of the neuron values which is then alleviated by the layer-normalization layers placed after each residual connection.

The decoder is the standard Transformer decoder in the recent translation systems . The notable difference between the decoder and the encoder is that to maintain the auto-regressive nature of the model, the self-attention layer of the decoder must be masked so that each state has only access to the past states. Moreover, an additional attention layer using the target hidden layer layers as queries and the encoder outputs as keys and values is placed between the self-attention and the feed-forward layers. Residual and layer-normalization are setup identically to the encoder.

This particular design of the Transformer has various advantages compared to previously proposed RNNs and CNNs networks. First, computation of each layer and sub-module can be efficiently parallelized over both mini-batch and time dimensions of the input. Second, the combination of residual and layer normalization is the key to enable greater depth configurations to be trainable, which is the main reason for the performance breakthrough in recent works in both MT and natural language processing .

4 Stochastic Layers

The high density of residual connections is the reason why Transformer is favourably trained with many layers. However, deep models in general suffer from either overfitting due to more complex architectures and optimization difficulty . Studies about residual networks have shown that during training the network consists of multiple sub-networks taking different paths through shortcut connections , and thus there are redundant layers. Motivated by the previous work of , we propose to apply stochastic residual layers into our Transformers. The method resembles Dropout , in which the key idea is the layers are randomly dropped during training. The original residual connection of an input $x$ and its corresponding neural layer $F$ has the following form:

In equation 3, the inner function $F$ is either self-attention, feed-forward layers or even decoder-encoder attention. The layer normalization as in keeps the magnitude of the hidden layers from growing large. Stochastic residual connections fundamentally apply a mask $M$ on the function $F$ , as follows:

Mask $M$ only takes or $1$ as values, generated from a Bernoulli distribution similar to dropout . When $M=1$ , the inner function $F$ is activated, while it is skipped when $M=0$ . These stochastic connections enables more sub-network configurations to be created during training, while during inference the full network is presented, causing the effect of ensembling different sub-networks, as analyzed in . It is non-trivial regarding how to the parameter $p$ for dropping layers, since the amount of residual connections for the Transformer is considerable. In other words, the lower the layer is, the lower the probability $p$ is required to be set. As a result, $p$ values are set with the following policy:

Sub-layers inside each encoder or decoder layer share the same mask, so each mask decides to drop or to keep the whole layer (including the sub-layers inside). This way we have one hyper-parameter $p$ for each layer.

As suggested by , the lower layers of the networks handle raw-level acoustic features on the encoder side, and the character embeddings on the decoder side. Therefore, lower layers $l$ have lower probability linearly scaled by their depth according to equation 4 with $p$ is the global-level parameter and $L$ is the total number of layers. Our early experiments with a constant $p$ for all connections provide evidence that dropping lower-level representations is less tolerable than dropping higher-level representations.

Lastly, since the layers are selected with probability $1-p_{l}$ during training and are always presented during inference, we scale the layers’ output by $\frac{1}{1-p_{l}}$ whenever they are not skipped. Therefore, each stochastic residual connection has the following form during training (the scalar is removed during testing):

Experimental Setup

Our experiments were conducted on the Switchboard-1 Release 2 (LDC97S62) corpus which contains over 300 hours of speech. The Hub5’00 evaluation data (LDC2002S09) was used as our test set. All the models were trained on 40 log mel filter-bank features which are extracted and normalized per conversation. We also adopted a simple down-sampling method in which we stacked 4 consecutive features vectors to reduce the length of input sequences by a factor of 4. Beside the filter-bank features, we did not employ any auxiliary features. We followed the approach from to generate a speech perturbation training set. Extra experiments are also conducted on the TED-LIUM 3 dataset which is more challenging due to longer sequences.

2 Implementation Details

Our hyperparameter search revolves around the Base configuration of the machine translation model in the original Transformer paper . For all of our experiments in this work, the embedding dimension $d$ is set to $512$ and the size of the hidden state in the feed-forward sub-layer is $1024$ . The mini-batch size is set so that we can fit our model in the GPU, and we accumulate the gradients and update every $25000$ characters. Adam with adaptive learning rate over the training progress:

in which the init_lr is set to $2$ , and we warm up the learning rate for $8000$ steps. Dropout (applied before residual connection and on the attention weights) is set at $0.2$ . We also apply character dropout with $p=0.1$ and label smoothing with $\epsilon=0.1$ .

Results

The experiment results on SWB testsets are shown in Table 1. A shallow configuration (i.e $4$ layers) is not sufficient for the task, and the WER reduces from $20.8\%$ to $12.1\%$ on the SWB test as we increase the depth from $4$ to $24$ . The improvement is less significant between $12$ and $24$ (only $5\%$ relative WER), which seems to be a symptom of overfitting.

Our suspicion of overfitting is confirmed by the addition of stochastic networks. At $12$ layers, the stochastic connections only improve the CH performance by a small margin, while the improvement was substantially greater on the $24$ layer setting. Following this trend, the stochastic $48$ -layer model keeps improving on the CH test set, showing the model’s ability to generalize better.

Arguably, the advantage of deeper models is to offer more parameters, as shown in the second column. We performed a contrastive experiment using a shallow model of 8 layers, but doubling the model size so that its parameter count is larger than the deep $24$ -layer model. The performance of this model is considerable worse than the $24$ layer model, demonstrating deeper networks with smaller size are more beneficial than a wider yet shallower configuration. Reversely, we found that the 48-layer model with half size is equivalent with the $12$ -layer variant, possibly due to over-regularization We did not change dropout values for this model, so each layer’s hidden layers are dropped more severely.

Our second discovery is that the encoder requires deeper networks than the decoder for the Transformer. This is inline with the previous work from who increases depth for the CNN encoder. As shown above, the encoder has learn representations starting from audio features, while the decoder handles the generation of character sequences, conditionally based on the encoder representation. The difference in modalities suggest different configurations. Holding the total number of layers as $48$ , we shift depth to the encoder. Our result with a much shallower decoder, only $8$ layers, but with $40$ encoder layers is as good as the $24$ -layer configuration. More stunningly, we were able to obtain our best result with the $36-12$ configuration with $20.6\%$ WER, which is competitive with the best previous end-to-end work using data augmentation.

Third, it was revealed that the combination of our regularization techniques (dropout, label-smoothing and stochastic networks) are additive with data augmentation, which further improved our result to $18.1\%$ with the $36-12$ setup. This model, as far as we know, establishes the state-of-the-art result for the SWB benchmark among end-to-end ASR models, as shown in table 2. Comparing to the best hybrid models with similar data constraints, our models outperformed on the CH test set while remaining competitive on the SWB test set without any additional language model training data. This result suggests the strong generalizability of the Stochastic Transformer.

Finally, the experiments with similar depth suggest that self-attention performs competitively compared to LSTMs or TDNNs . The former benefits strongly from building deep residual networks, in which our main finding shows that depth is crucial for using self-attention in the regimen of ASR.

Table 3 shows our result on TED-LIUM (version 3) dataset. With a similar configuration to the SWB models, we outperformed a strong baseline which uses both an external language model trained on larger data than the available transcription and speed perturbation, using our model with 36 encoder layers and 12 decoder layers. This result continues the trend that these models benefit from a deeper encoder, and together with the stochastic residual connections we further improved WER by $21.8\%$ relatively, from $14.2$ to $11.1\%$ . Given the potential of the models We did not have enough time for a thorough hyper-parameter search by the time of submission, it is strongly suggested that better results can be obtained by further hyper-parameter optimization.

Related Work

The idea of using self-attention as the main component of ASR models has been investigated in various forms. combines self-attention with LSTMs, while uses self-attention as an alternative in CTC models. A variation of the Transformer has been applied to ASR with additional TDNN layers to downsample the acoustic signal . Though self-attention has provided various benefits such as training speed or model interpretability, previous works have not been able to point out any enhancement in terms of performance. Our work provides a self-attention-only model and showed that with high capacity and regularization, such a network can exceed previous end-to-end models and approach the performance of hybrid systems.

Conclusion

Directly mapping from acoustics to text transcriptions is a challenging task for the S2S model. Theoretically, self-attention can be used alternatively to TDNNs or LSTMs for acoustic modeling, and here we are the first demonstrate that the Transformer can be effective for ASR with the key is to setup very deep stochastic models. State-of-the-art results among end-to-end models on $2$ standard benchmarks are achieved and our networks are among the deepest configurations for ASR. Future works will involve developing the framework under more realistic and challenging conditions such as real-time recognition, in which latency and streaming are crucial.

Acknowledgements

The work leading to these results has received funding from the European Union under grant agreement No 825460 and the Federal Ministry of Education and Research (Germany) / DLR Projektträger Bereich Gesundheit under grant agreement. No 01EF1803B. We are also grateful to have very useful comments from Elizabeth Salesky.