DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Shaoshi Ling, Yuzong Liu

Introduction

In the long history of semi-supervised learning (SSL) in speech recognition, self-training approach and knowledge distillation , or known as teacher-student model training are the two commonly used SSL methods. Recent success of representation learning enables a new approach towards leveraging unlabeled data. In natural language processing community, BERT , ELMo , XLNet , GPT and its follow-ups are classical examples of representation learning. The key philosophy of representation learning is based on using self-supervised learning, where we obtain ‘free’ labels from unlabeled data and train them in a supervised manner via some proxy tasks. In the context of BERT , two proxy tasks are defined including masked language model task and two-sequence prediction task. These proxy tasks are designed to force the learning of a robust, meaningful representation. After the representation has been learned, a downstream task model is then trained using labeled data with the learned representation. Optionally, the representation learning block and downstream task block can be fine-tuned together.

Learning efficient speech representation can be traced back to restricted Boltzmann machine , which allows pre-training on large amounts of unlabeled data before training the deep neural network speech models. More recently, speech representation learning has drawn increasing attention in speech processing community and has shown promising results in semi-supervised speech recognition . The design of proxy tasks in learning speech representation can be categorized into two types. The first type is based on contrastive loss and has been applied to speech representation such as wav2vec and its variants . The model is trained to learn representations containing information that most discriminates the future or masked frame from a set of negative samples via contrastive loss. The second type is based on reconstructive loss. The proxy task for these representation learning methods is to reconstruct temporal slices of acoustic features based on contextual information. These reconstruction tasks can be defined as autoregressive reconstruction, or masked-based reconstruction. APC and its follow-up are examples to use autoregressive reconstruction loss. In many state-of-the-art pretrained language model task, masked-based prediction is adopted in the proxy tasks such as BERT and XLNet . In speech, instead of prediction, we randomly mask temporal slices of acoustic features and attempt to reconstruct them .

Orthogonal to the contrastive-/reconstructive-loss based speech representation learning, vector-quantized speech representations have been proposed . One motivation to apply vector quantization (VQ) is that enforcing quantization can lead to better linguistic unit discovery due to the discrete nature of phonetic units. In VQ-APC , the authors use VQ as a way to limit model capacity and control information needed in encoding representation. In VQ-wav2vec and wav2vec 2.0 , the author use VQ to facilitate direct application of BERT and other NLP algorithms.

In this paper, we introduce DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization. We take inspirations from many recent advances in speech representation learning, and propose multiple improvements over vanilla DeCoAR. We summarize the contributions of this paper as follows:

We propose to use Transformer as encoding block and replace LSTM in the vanilla DeCoAR;

We present a deep contextualized acoustic representation learning approach with the addition of a vector quantization layer;

We propose a new objective function that combines masked-based reconstruction loss with VQ diversity loss.

Related Work

DeCoAR stands for deep contextualized acoustic representations, and was proposed in our previous work . As depicted in Figure 1, DeCoAR consists of two modules, an encoder module and a reconstruction module. For an input speech sequence $\mathbf{X}=(\mathbf{x}_{1},\cdots,\mathbf{x}_{T})$ , an encoder module consists of a stacked forward and backward LSTMs, and computes a hidden representation that encodes information from both previous and future frames (i.e. $\overrightarrow{\mathbf{z}}_{t},\overleftarrow{\mathbf{z}}_{t}$ ). For each temporal slice $(\mathbf{x}_{t},\mathbf{x}_{t+1},...,\mathbf{x}_{t+K})$ , the reconstruction module takes the concatenated forward state at time $t$ and backward state at $t+K$ as inputs, and uses position-dependent feed-forward networks to recontruct each frame. Formally, the DeCoAR objective is defined as follows:

where $\text{FFN}_{i}$ is a position-dependent feed-forward network to reconstruct the $i$ -th frame in the slice. The final loss $\mathcal{L}$ is calculated over all possible slices in the entire sequence in an autoregressive manner, defined as: $\mathcal{L}=\sum_{t=1}^{T-K}\mathcal{L}_{t}$ .

2 Vector-quantized Representation Learning

Wav2vec 2.0 is one of the successful examples in representation learning. It uses 10 minutes of labeled data with 53k hours of unlabeled data to achieve a word error rate (WER) of 5.2%/8.6% on LibriSpeech benchmark. The model relies on a diverse codebook learned to correlate the underlying speech units to representations via contrastive loss. Discretizing the continuous representation enables applications of many state-of-the-art NLP algorithms. In wav2vec 2.0, after applying VQ operations, the model is trained using a masked LM style loss, similar to BERT.

One potential challenge in learning optimal codebooks with contrastive loss is posed by data with nuisance factors such as noise and other adverse conditions. In these cases, the codebook can be trivially optimized by assigning acoustic condition (e.g. voice activity, noise) to the codebook. A potential work-around is to use frame reconstruction as objective so that the network can leverage all available information of the input feature to guide the learning of a robust representation.

2.2 VQ-APC

VQ-APC introduced an novel approach that inserted a VQ layer before frame prediction. The motivation of using VQ is to quantify the information needed to encode speech representation and control the capacity of the models. The model uses autoregressive predictive coding (APC) as objective, instead of a contrastive predictive coding (CPC). Their experiments showed APC/reconstruction objective performed better than CPC/constrastive objective under the same condition. They also demonstrated the learned VQ codes highly correlate to phoneme path, suggesting VQ can be used to capture linguistic units in an implicit way.

Proposed Framework

DeCoAR 2.0 is a follow-up work based on DeCoAR, and we take inspirations of recent advancement in natural language and speech representation learning. The left figure in Figure 2 illustrates the proposed DeCoAR 2.0 architecture. The model consists of three modules. The first module is the encoder network that maps input masked acoustic frames $\mathbf{x}_{t+1},\cdots,\mathbf{x}_{t+K}$ into a latent representation $\mathbf{z}_{t+1},\cdots,\mathbf{z}_{t+K}$ via multiple Transformer blocks. The second module is the vector quantization network that maps latent representation $\mathbf{z}_{t+1},\cdots,\mathbf{z}_{t+K}$ to a new quantized representation $\mathbf{v}_{t+1},\cdots,\mathbf{v}_{t+K}$ . The last module, reconstruction network, takes the quantized representation to a feed-forward network and reconstructs the original input frames as $\mathbf{y}_{t+1},\cdots,\mathbf{y}_{t+K}$ . We will describe the design of each module and its training criterion in the following sections.

We replace forward/backward LSTM with Transformer, due to its superiority in modeling long context . While RNN/LSTM can model long context in theory, the Transformer achieves better performance thanks to its multi-head attention mechanism that captures the relationship for any arbitrary pair of samples in a long input sequence. In our encoder, we use a 1D convolutional layer with kernel size of 256 and 16 groups. This performs an implicit relative positional encoding as pointed out in . The convolution is followed by Gaussian Error Linear Unit (GELU) and layer normalization. The output is then fed into the deep transformer encoder network and produce a sequence of hidden vectors $\mathbf{Z}=(\mathbf{z}_{1},\cdots,\mathbf{z}_{T})$ .

In our masking strategy, we mask a proportion of the feature and replace them with a trainable feature vector. We randomly mask the subsequent $K$ consecutive time steps from every sampled index; spans are not overlap and we masked around 40% frames in total.

2 Quantization Module

We introduce a quantization module in DeCoAR 2.0 framework. Quantization module takes the latent representation $\mathbf{z}_{t}$ from encoder module, and map it to a new representation $\mathbf{v}_{t}$ . This is done by selecting one entry from a fixed-size codebook $C=\{c_{1},\cdots,c_{V}\}$ , where $V$ is the size of the codebook, and apply a linear transformation to obtain $\mathbf{v}_{t}$ . Selecting an entry in a discrete cookbook is not differentiable. To mitigate the problem, we use the Gumbel-Softmax loss with reparameterization trick. In line with VQ-wav2vec , wav2vec 2.0 and VQ-APC , we use the straight-through Gumbel-Softmax estimator .

In our quantization module, we use multiple codebooks to obtain quantized representations. Formally, given the latent representation $\mathbf{z}$ from the encoder module, a set of codebooks $C_{1},\cdots,C_{G}$ where $G$ is the number of codebooks, $V$ entries in each codebook, we select one variable from each codebook and stack the resulting vectors followed by a linear transformation to obtain new representation $\mathbf{v}$ . In order to train which entry to select, we map the encoder output $\mathbf{z}$ to logits $\mathbf{l}\in\mathcal{R}^{G\times V}$ via a linear layer, and the probability of selecting the $j$ -th code in $g$ -th codebook is defined as follows:

where $\tau>0$ is the softmax temperature, $n=-\ln(-\ln(u))$ and $u$ are uniformly sampled from $\mathcal{U}(0,1)$ . In inference, the index with largest value in logits $\mathbf{l}$ is selected from each codebook.

3 Training Objective

Since vector quantization layers are known to significantly disrupt model training, we apply the diversity loss proposed in wav2vec 2.0 to encourage the equal use of all entries in each codebook. Diversity loss maximizes the entropy of the averaged softmax distribution over the entries for each codebook in each mini-batch. Formally, the diversity loss is defined as:

Our final training objective is a combination of the two loss functions, weighted by a hyperparameter $\alpha$ :

4 Semi-supervised Speech Recognition with DeCoAR 2.0

After we have pre-trained the DeCoAR 2.0 model on unlabeled data, we freeze all the parameters in the network. We remove the quantization module and reconstruction module. The representations from the Transformer encoder module are then attached to a downstream ASR system. This ASR system can be either a conventional acoustic model in a hybrid-based ASR system, or an end-to-end speech recognition such as RNN-Transducers or Encoder-Decoder based model . Note that in our framework, we only train parameters for the downstream ASR model and leave all parameters in the encoder module fixed (i.e. no backpropagation to all layers in encoder module).

Experimental Setup and Results

Our experiments were conducted on the publicly available LibriSpeech dataset. To simulate different SSL scenarios, we varied the labeled data size from 1-hour, 10-hour, up to 100-hour. The 100-hr dataset is based on train-clean-100 split, and the 1-hr/10-hr subsets are randomly selected from it.

To train the DeCoAR 2.0 model, we used the entire 960 hours of LibriSpeech dataset as unlabeled data. We followed the conventional frontend feature extraction, and used a 80-dimensional log-mel filterbank features, which were extracted with a 25ms sliding window at a 10ms frame rate. The features were normalized via mean subtraction and variance normalization on a per-speaker basis.

For the encoder network in DeCoAR 2.0, we used 12 Transformer blocks, each consists of a multi-head self-attention sublayer followed by a feed forward sublayer. For fair comparison, we set the model dimension to 768, the inner dimension in feed forward sublayer to 3072, with 8 attention heads as used in wav2vec 2.0 base model. The slice size $K$ was set to 20. We optimized the network with Adam and used learning rate warm-up for the first 32000 updates to a peak of 0.0003, and then linearly decayed it. We grouped the input sequences by length with a batch size of 128 (we chopped the maximum length to 15 seconds), and trained the models on 16 GPUs for 150 epochs. The Gumbel softmax temperature $\tau$ is annealed from 2 to a minimum of 0.5 by a factor of $0.999995$ at every update. We use weight $\alpha=0.1$ for the diversity loss and we set $G=2$ and $V=320$ for the quantization module.

2 Semi-supervised Speech Recognition Experimental Results

We trained acoustic models using CTC loss on labeled data as downstream tasks. Unlike conventional HMM-based hybrid ASR, training acoustic model with CTC loss gets rid of the need to prepare frame-wise alignments and other tedious processes such as preparing state-tying trees. The total size of CTC labels were 71 phonemes derived from CMU lexicon, plus one blank symbol. For decoding, we used WFST-based decoding using EESEN . CTC labels, lexicons and a 4-gram language model for LibriSpeech were composed into a WFST-based decoding graph. We set the acoustic model scale to $1.0$ , and the blank symbol prior scale to $0.3$ . We used dev-clean for validation and test-clean, test-other for evaluation.

We trained different ASR systems for comparison, using different acoustic representations, including wav2vec 2.0 features , VQ-APC features , our previously proposed DeCoAR features , DeCoAR 2.0 features as proposed in this work. For wav2vec 2.0 features , we obtained 768-dimensional representations from the wav2vec 2.0 base model downloaded fromhttps://github.com/pytorch/fairseq/tree/master/examples/wav2vec, which was pre-trained on 960-hour LibriSpeech data with contrastive loss and had the exactly same encoding network as ours. For VQ-APC features, we trained a VQ-APC model using the official codehttps://github.com/iamyuanchung/VQ-APC provided by the authors on 960-hour LibriSpeech. We obtained 512-dimensional VP-APC representations as input features. DeCoAR and DeCoAR 2.0 have dimensionality of 2048 and 768, respectively. For all systems trained on learned speech representations, the downstream ASR model are 2 layers of bidirectional LSTMs with CTC loss. In line with our previous work , we also train purely supervised systems using conventional filterbank features. These models are trained using 6 layers of bidirectional LSTMs with CTC loss. We also trained a purely supervised system using the entire 960-hour dataset uisng filterbank features as a baseline.

Table 1 shows the results on semi-supervised LibriSpeech experiments. We conducted our semi-supervised experiments using 1 hour, 10 hours, and 100 hours of training data. Our proposed approach significantly outperforms the pure supervised filterbank baselines. In particular, under extremely data-sparse conditions, the proposed DeCoAR 2.0 methods achieved highly competitive performance, with a WER of 5.43%/13.27% for test-clean/test-other using 10 hours of labeled data, and a WER of 13.75%/29.13% for test-clean/test-other using only 1 hour of labeled data. One notable observation is that using 10 hours of labeled data can already outperform the system trained on the full 960-hour data with filterbank features by 6.7%/8.5% relative WER improvements on test-clean/test-other.

Among different speech representations, wav2vec 2.0 and DeCoAR 2.0 performed favorably compared to VQ-APC and DeCoAR. DeCoAR 2.0 is comparable to wav2vec 2.0 in all different SSL conditions as well. It is worth noting that we did not perform fine-tuning for all representation learning layers as these models were trained in different stacks. We are interested in gauging the performance comparison by directly using the resulting speech representations produced from different pre-trained speech representation models.

We conduct an ablation study to investigate the effect of inserting VQ layer in DeCoAR 2.0 in Table 2, and confirm the VQ module is beneficial for ASR tasks. We hypothesize that vector quantization forces the DeCoAR model to reduce the model capacity and focus more on informative factors such as linguistic/phonetic unit discovery and less so on other factors such as speaker traits, acoustic condition.

Conclusion

In this paper, we present vector quantized Deep Contextualized Acoustic Representation (DeCoAR 2.0), an improved speech representation learning approach based on DeCoAR and vector quantization. DeCoAR 2.0 has multiple modification over the its predecessor, with a deep Transformer as encoding block, and the addition of a vector quantization module before reconstruction module. In extreme data-limited semi-supervised conditions, we observe that using 10 hours of labeled data with DeCoAR 2.0 achieved performance on par with the system trained on 960 hours of conventional filterbank features. DeCoAR 2.0 also performed comparably to wav2vec 2.0 in all different semi-supervised scenarios. Future work includes exploring the efficacy of representation learning in real world data including noisy and adverse conditions, and extension to neural transducers and other end-to-end ASR systems as downstream tasks.