Recent Developments on ESPnet Toolkit Boosted by Conformer

Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang

eess.AS cs.SD

Introduction

Transformer architecture has drawn immense interest recently and became the dominated model due to its effectiveness across various sequence-to-sequence tasks, like machine translation, language modeling (LM), and automatic speech recognition (ASR) . One reason for the success of Transformer model is that the multihead self-attention layers can learn long-range global context better than the recurrent neural networks (RNNs). However, for speech processing tasks, not only the global context, but also the local information is crucial to capture some particular properties of speech, like coarticulation and monotonicity. Convolution neural networks (CNNs), on the other hand, are good at extracting fine-grained local feature patterns. Recently, Gulati et al. proposed a novel architecture with combination of self-attention and convolution in ASR models, which is named Conformer. With this proposed design, the self-attention layer learns the global context while the convolution module efficiently captures the local correlations synchronously.

In addition to the ASR task, other speech processing tasks can also have such benefits and obtain improvement when giving local information. In this study, we aim to explore the efficiency of Conformer on various end-to-end speech processing applications, including automatic speech recognition (ASR), speech translation (ST), speech separation (SS), and text-to-speech (TTS). We provide intensive comparisons of Conformer with Transformer on lots of publicly available corpora and try our best to share the practical guides (e.g., learning rate, hyper-parameters, network structure) on the use of Conformer. We also prepare to release the reproducible recipes and state-of-the-art setups to the community to succeed our exciting outcomes.

We extend the Conformer architecture to various end-to-end speech processing applications and conduct comparative experiments with Transformer.

We share our practical guides for the training of Conformer, like learning rate, kernel size of Conformer block, and model architectures, etc.

We provide reproducible benchmark results, recipes, setups and well-trained models on a large number of publicly available corporaDue to the page limitation, we are not able to cite all references. Instead, corresponding links are embed in the corpora names. in our open source toolkit ESPnet .

CONFORMER

Our Conformer model consists of a Conformer encoder proposed in and a Transformer decoder. The encoder is a multi-blocked architecture and each block is stacked by a positionwise feed-forward (FFN) module, a multihead self-attention (MHSA) module, a convolution (CONV) module, and another FFN module in the end. We apply layer normalization (LN) before each module and dropout followed by a residual connection afterward (pre-norm), as in . This section describes the details of each module in the encoder.

The idea of MHSA module is to learn an alignment in which each token in the sequence learns to gather from other tokens . For each single head $h$ , the output of attention computation can be formulated as:

Besides, Conformer also integrates a position encoding scheme from TransformerXL to generate better position information for the input sequence with various lengths, named relative positional encodings. For an input sequence $\mathbf{X}$ , the computational procedure can be summarized as:

2 Convolution module

Figure 1 illustrates the details of CONV module. The CONV module starts with a 1-D pointwise convolution layer and a gated linear units (GLU) activation . The 1-D pointwise convolution layer doubles the input channels, while the GLU activation splits the input along the channel dimension and conducts an element-wise product. After that, it is followed by a 1-D depthwise convolution layer, a batch normalization (BN) layer, a Swish activation, and another 1-D pointwise convolution layer.

3 Pointwise feed-forward module

The FFN module in original Transformer is composed of two linear transformations with a ReLU activation in between, as follows:

Different from Transformer, Conformer introduces another FFN module and replaces the ReLU activation with the Swish activation. Besides, inspired by Macaron-Net , the two FFN modules are following a half-step scheme and sandwiching the MHSA and CONV modules. Mathematically, for input $\mathbf{X}$ , the output is:

4 Conformer block

Figure 2 shows how to combine each module together. The difference between the Conformer block and Transformer block include: the relative positional encoding, the integrated CONV module, and a pair of FNN modules in the Macaron-Net style.

SPEECH APPLICATIONS

In ASR tasks, the Conformer model predicts a target sequence $Y$ of characters or byte-pair-encoding (BPE) tokensSentencePiece toolkit is used to generate the BPE tokens. from an input sequence $\mathbf{X}$ of 80 dimensional log-mel filterbank features with/without 3-dimensional pitch features. $\mathbf{X}$ is first sub-sampled in a convolutional layer by a factor of 4, as in , and then fed into the encoder and decoder to compute the cross-entropy (CE) loss. The encoder output is also used to compute a connectionist temporal classification (CTC) loss for joint CTC-attention training and decoding . During inference, token-level or word-level language model (LM) is combined via shallow fusion.

ST tasks adopt the same framework defined in ASR. It directly maps speech from a source language to the corresponding translation in the target language. In order to eliminate the serious under-fitting problem, we initialize the ST encoder by a pre-trained ASR encoder and start the ST decoder from a pre-trained machine translation (MT) decoder, as in .

For the SS tasks, the Conformer model is optimized to estimate the time-frequency mask for each individual speaker given a speech mixture. The model is trained with utterance-level permutation invariant loss (uPIT) . Different from the ASR system, the Conformer model here only contains the encoder, followed by an additional linear layer and activation function to predict the masks.

TTS tasks use the Conformer encoder for non-autoregressive TTS models , which generates a sequence of log-mel filterbank features from a phoneme or character sequence in cooperation with the duration predictor . The whole model is optimized to minimize the L1 loss for the target features and the mean square error (MSE) loss for the durations.

SPEECH RECOGNITION EXPERIMENTS

To evaluate the effectiveness of our Conformer model, we conduct experiments on a total of 25 ASR corpora, including various recording environments (clean, noisy, far-field, mixed speech), languages (English, Mandarin, Japanese, Spanish, low-resource languages), and sizes (10 - 960 hours). Most of the corpora are followed the same data preparation procedure as in Kaldi . Optionally, we also use speed perturbation at ratio 0.9, 1.0, 1.1 and SpecAugment for the data augmentation in some corpora.

For each corpus, the detail configurations of our Conformer model are same as ESPnet Transformer recipes ( $\text{Enc}=12,\text{Dec}=6,d^{\text{ff}}=2048,H=4,d^{\text{att}}=256$ ). Particularly, the number of attention heads and attention dimensions are different for Librispeech ( $H=8,d^{\text{att}}=512$ ). The convolution subsampling layer has 2-layer CNN with 256 channels, stride with 2, and kernel size with 3. For different corpora, we train 20-100 epochs and average the last 10 best checkpoints as the final model. We tune the learning rate coefficient (e.g., 1-10) and the kernel size of CONV module (e.g., 5-31) on the corresponding development sets to obtain the best results. Detail setups can be referred to ESPnet recipeshttps://github.com/espnet/espnet.

2 Results

Table 1 shows the character and word error rate (CER/WER) results on each corpus. It can be seen that Conformer model outperforms Transformer on 14/17 corpora in our experiments and even achieves state-of-the-art results on several corpora, like AIDATATANG and AISHELL-1. Instead of the single-speaker speech, it also brings about 7% relative improvement compared with Transformer on the multi-speaker WSJ-2mix data. Besides, we also conduct experiments to investigate the generalization of Conformer models on low-resource language corpora, as shown in Table 2. Conformer achieves more than 15% relative improvements in all 8 different languages compared with Transformer model.

Since our Conformer model uses the same decoder framework as Transformer, the performance gains may come from the additional local information provided by the CONV module. Thus, we study the effects of the CONV module by training a pure CTC model or a Transducer model with the Conformer encoder. Table 3 summaries the CER/WER results of two pure CTC models, while Table 4 shows the CER results of different Transducer models. We use a single-LSTM layer decoder in all Transducer models. Detail setups can be referred to ESPnet recipes.3. Both Conformer-CTC and TDNN-Conformer-Transducer models show consistent improvement and the Conformer-CTC model even achieves competitive results over Transformer with a decoder. From above results, we can conclude that Conformer shows superior performance in various types of ASR corpora, even in the challenging far-field, mixed speech, and low-resource language scenarios.

3 Discussion

Following are some training tips from our experiments:

When Conformer occurs a sudden accuracy drop on the training set, decreasing the learning rate can lead to more stable training. We use the learning rate in {1, 2, 5, 10} for different corpora.

The kernel size of the CONV module is related to the input sentence length in the corpora. We use the kernel size in {5, 7, 15, 31} for different corpora.

In addition to the warmup training strategy , the OneCycleLR learning scheduler can also give a stable training of self-attention based models.

SPEECH TRANSLATION EXPERIMENTS

The configuration of all our ST models are same as ASR systems described in Sec 4.1. During the training, we initialized the model parameters with the pre-trained encoder and decoder optimized on ASR and MT parallel data involved in each corpus, respectively. We conduct the ST experiment on the Fisher-CallHome Spanish corpus, and evaluate on five common test sets. We use the Fisher-dev set as the development set. The input speech feature is same as the ASR system and the output tokens are 1k BPE tokens. Same data augmentation techniques are used to improve the performance.

The Conformer model achieves about 10% relative improvement over the baseline Transformer model in the ST task as well. To validate the gains did not come from just increasing the model parameters with additional CONV and FFN modules, we also train a Conformer-small model by decreasing $d^{\text{ff}}$ from 2048 to 1024 to keep the parameter budget for a fair comparison. Although the BLEU score is slightly decreased by halving $d^{\text{ff}}$ , our Conformer-small model still significantly outperforms the Transformer model.

SPEECH SEPARATION EXPERIMENTS

For the SS task, we compare our Conformer model with Transformer and bidirectional long short-term memory (BLSTM) on WSJ0-2mix corpus. Both models are trained with uPIT based on Phase Sensitive Masks (PSM) and ReLU activation function. The input features are 129-dimensional short-time Fourier transform (STFT) magnitude spectra computed with a sampling frequency of 8 kHZ, a frame size of 32 ms, and a 16 ms frame shift. The BLSTM-uPIT model has 3 BLSTM layers ( $d=896$ ), while the Transformer-uPIT and Conformer-uPIT model consist of 3 blocks ( $d^{\text{ff}}=896,d^{\text{att}}=1024,H=8$ ).

Table 6 summaries the Signal-to-Distortion Ratio (SDR) results of different models on the WSJ0-2mix sets, the current benchmark dataset to validate monaural speech separation. The results show that our Conformer-uPIT model gets competitive results compared with the BLSTM-uPIT model and achieves a significant improvement over the Transformer-uPIT model.

TTS EXPERIMENTS

Table 7 shows mel-cepstral distortion (MCD), which was calculated with 0-34 order mel-cepstrum and dynamic time warping (DTW) to match the length between the groundtruth and the prediction. The result demonstrates that the Conformer-based models always bring consistent improvement for all corpora, achieving the best performance among the compared models.

CONCLUSION

We conducted comparative studies of the Conformer model in various speech applications with a large number of publicly available corpora. Specifically, the experiments were conducted on 25 ASR corpora (17 common sets + 8 low-resource sets), 1 ST corpus, 1 SS corpus, and 3 TTS corpora. From the experiments, our Conformer-based models achieved significant improvements in many ASR, ST and TTS tasks and competitive results in SS tasks. We believe that the various benchmark results, reproducible recipes, well-trained models and training tips described in this paper will accelerate the Conformer research on speech applications. Our aim for this activity is to fill out the gap between high-resource research environments in big players and those in the academia or small-scale research groups by providing these up-to-date research environments.