Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

eess.AS cs.CL cs.SD

Introduction

Recent works have demonstrated an interest in unsupervised representation learning as a pretraining method to obtain good speech features for downstream tasks with little labelled data . While Contrastive Predictive Coding (CPC) and derivatives appear to be versatile methods for unsupervised representation learning , they do not yet reach the state-of-the-art (SOTA) results on purely unsupervised learning metrics .

Data augmentation is useful for supervised training, and is also a key component in unsupervised setups in the image domain . It is not well established in unsupervised learning for speech, where the sequential nature of the signal may introduce specificities.

Our first objective is to explore several types of time-domain data augmentation (additive noise, masking, reverberation) and several methods for augmenting in the contrastive framework (in the past, future, or both) in English (LibriSpeech). In a second stage, we extend the results to other languages (French and Mandarin) in the zero-resource 2017 benchmark . Lastly, we show that data augmentation benefits semi-supervised training, using the Libri-light benchmark .

Related work

CPC. Van den Oord et al. introduced Contrastive Predictive Coding, a method for unsupervised representation learning. Applied to speech, CPC trains a convolutional encoder and a predictor for future embeddings of the encoder. To prevent mode collapsing, the loss is contrastive: an embedding should be close to positive future embeddings and distant from negative future embeddings. CPC was used as pretraining for ASR and speaker identification . Non-contrastive versions of predictive coding with fixed embeddings can learn generic multi-task representations . Here we use a deeper and optimized version of the CPC implementation of .

Data augmentation for ASR. Basic time-domain augmentations modify the sampling rate of the input by a small factor ( $\pm 10\%$ ), which changes both the duration and pitch . Another one consists of adding noise, convolved with a room impulse response function to simulate point sources spread in space . SpecAugment is a spectral-domain augmentation whose effect is to mask bands of frequency and/or time. We introduce WavAugment, that implements these augmentations in the time domain and is optimized for applying augmentations on-the-fly as part of data loading.

Our work is close to , which applies data augmentation techniques to representation learning (autoencoders). However, they evaluated them in terms of pretraining for a downstream task not in terms of the learned representation.

Method

Our method is based on a state-of-the-art CPC architecture . We explore how to perform data augmentation and introduce the WavAugment package.

The architecture is summarized in Figure 1. A convolutional encoder network produces a representation $z_{t}$ of the raw audio waveform. The sequence $(z_{t})$ is then passed to a recurrent context network to build our final representation $c_{t}$ . At each step, we apply $c_{t}$ to a predictor neural network $Pred$ with several outputs $Pred^{k}$ each one reconstructing future representations $z_{t+k}$ ( $0<k\leq K$ , $K=12$ ). The loss is contrastive and tries to minimize the dot product between the predicted and correct future representation while maximizing the dot product with a sample of 128 negative examples $\mathcal{N}_{t,k}$ taken from the batch. This gives the following loss:

CPC2 is a modified version of the CPC architecture in . The encoder architecture is unchanged (5 convolutional layers with kernel sizes , strides and hidden dimension 256). We increase the depth of the auto-regressive network, which improves accuracy (see Supplementary Table S1) For the recurrent context nextwork, we use a 2-layer LSTM, as a tradeoff between feature quality and training speed. In the prediction network, we replace the $k$ independent transformers in , each one predicting a specific time-step ahead, to a single multi-head transformer layer with $k$ classifiers at its heads. This has a limited impact on accuracy but dramatically decreases training time.

2 Data augmentation and CPC

As discussed in Section 3.1, the encoded representations $z_{t}$ are used in two ways: (a) to calculate the contextual representation $c_{t}$ , and (b) as target predictions (positive or negative candidates).We refer to the representation $z_{t}$ as past and the targets $z^{+},z^{-}$ as future. The model predicts future representations based on its past, by learning to discriminate it from a samples of negative candidates (Figure 1).

We can apply two different augmentations on the same speech sequence and use them to calculate past and future representations. This separation opens a plethora of possibilities for data augmentation: applying the same augmentation on all sequences in the batch (on query sequence and all positive and negative candidates); augmenting each sequence independently (past and future have identical augmentations, but negatives have independent augmentations); augmenting only past; augmenting only future; augmenting both past and future independently (past+future setting). Preliminary experiments demonstrated that the most promising approaches are either augmenting only the past representation or applying two independent augmentations on both past and future (past+future). In this work, we therefore focus on these two options.

3 WavAugment

Our experimental setup requires to apply independent data augmentations on short audio sequences ( $\approx 1$ s). For this, we developed WavAugment, a library that implements time-domain augmentations. WavAugment is publicly available at github.com/facebookresearch/WavAugment. WavAugment builds upon a C++ API to libsoxhttp://sox.sourceforge.net/sox.html that implements dozens of audio processing transformations. WavAugment has a Pytorch interface and Pytorch- and libsox-based effects can be interleaved transparently.

4 Datasets and evaluation measures

In the experiments reported below, unless reported otherwise, we trained our CPC model on Librispeech-100h , which is a set of short sentences in good quality (clean) read speech, from a balanced set of speakers. We directly used all of the files, without filtering or modification. In Experiment 2, we introduce two similar datasets in French and Mandarin, respectively. The French dataset was created by selecting the French data from the Librivox websitehttps://librivox.org/, and Mandarin from the MagicData dataset . The recordings were cut into “utterance”-like segments using pyannote’s Voice Activity Detector . Both datasets had a similar number of speakers and total duration as Libri-Speech (250 speakers, 76h and 80h respectively).

We tested the learned representation using the Libri-light ABX metric for unsupervised representation learning. This distance-based metric estimates the probability that a speech segment $X$ is closer to a segment $A$ with the same transcription than to a segment $B$ with a different transcription. The distance is the DTW-realigned average angle (arc-cosine of the normalized dot product) between each frames. The test uses minimal pairs of triphones that only change in the central phoneme (’bet’ vs ’bit’), and is conducted within-speaker ( $A$ , $B$ and $X$ are from the same speaker) and across-speaker ( $A$ and $B$ are from one speaker, $X$ from another one). This metric has been shown to be useful to analyse the linguistic content of speech features without having to train a classifier , and has been used in the Zero Resource Challenge series .

Experiments

We focus on 5 augmentations that were either proposed earlier or that can potentially inject useful invariances in the speech representations. We selected: pitch modification (pitch), additive noise (add), reverberation (reverb), band reject filtering (bandrej), and time masking (tdrop). The last two augmentations are similar to those used in SpecAugment . The pitch can be attributed to the source (how the speaker talks), add and reverb to the communication channel, and bandrej & tdrop to noise in the neural representation of the speech. When we compose augmentations (indicated by ’+’), they are applied in that order.

In pilot experiments we calibrated the strength of the augmentations looking at the overall ABX results (within and across on the dev clean and other set of Libri-light . For pitch, the applied change in the pitch is an integer sampled uniformly between +300 and -300 (the change value is measured by 1/100 of a tone). In reverb, we uniformly sample room-scale between 0 and 100, fixing other parameters to defaults. tdrop zeroes out one random subsequence of length 50ms. We found that bandrej performs best when we set the maximal width of the rejected spectrum to $150$ Hz.

We discovered accidentally that for additive noise, low frequencies are more effective than high frequencies. We therefore explored systematically the effect of the spectral characteristics of noise by filtering sounds from the MUSAN dataset in successive frequency bands. We selected 5 broad bands, defined by 4 cutoff points by the tripling of the frequency: 80Hz, 240Hz, 720Hz, 2160Hz). We found that the optimal additive noise was obtained by bandpass filtering MUSAN sounds in the $$ Hz range, which corresponds roughly the human F0 (see Supplementary Table S3).

2 Experiment 1: Data augmentation combinations

We first tested these five augmentations alone, either applying them to the past of the sequence or independently to past and future (past+future) (see Section 3.2).

On analyzing single augmentations in Table 1, we first observed that in many cases applying augmentations on past performs as well as, or even better, than past+future (pitch, reverb, bandrej). The only augmentation performing better on past+future is add.We did not experiment with tdrop applied on past+future as this will zero out the predicted sequences. According to their average performance, the individual augmentations can be sorted, from most to least useful: pitch, add, reverb, tdrop, and bandrej.

Next, we study the performance of combinations of augmentations. We decided to drop bandrej from consideration due to its poor results. We only consider augmenting past, as this gives roughly the same quality of representations, but requires less computation. As a result, we have 6 possible two-way, 4 three-way, and 1 four-way combination of effects. The results are in the lower part of Table 1, and they show that pitch+add+reverb performs best in 3 out of 4 metrics.

We chose this combination and evaluated the corresponding model on the Libri-light test set. The results are reported in Table 2 and show that, across all metrics, data augmentation yields relative improvements of 18-22% over no augmentation, and ends up with better results than the original CPC algorithm run on the much larger 60k hours dataset.

3 Experiment 2: Extending to other languages

In this experiment, we tested whether our data augmentation technique could be extended to other languages. We selected the three dev datasets of the ZeroSpeech Challenge 2017, covering English, French, and Mandarin. As in the previous experiment, the metrics are the within- and across- ABX test provided by the Challenge. For training, we used both the small in-domain training sets provided by the Challenge (45h, 24h, and 2h30, respectively), and our own, larger, out-of-domain training sets. For English, we used Librispeech-100 (100h), for French, the 76h of French-librivox, and Mandarin, the 80h of MagicData described in Section 3.4. We also observed, training on Librispeech-100 and testing on Libri-light dev, that using larger datasets in combination with data augmentation allowed to benefit from increasing the number of LSTM layers to 3 (see Supplementary). We included this modification in the experiments.

The results are shown in Table 3. As can be seen, while noise augmentation improves the score on all three languages, we cannot reach the SOTA with the small training datasets provided from the challenge. We can however, be on par with or improve over best performing baseline with our out-of-domain train sets (same languages, larger datasets), in particular with the larger model. This shows that while our technique scales with dataset size, it is still less data efficient than the techniques described in Heck et al. and Choroskwi et al. . Note however, that both studies used speaker adaptation which are outside the scope of what can be done with standard time domain data augmentation techniques.

4 Experiment 3: Pretraining and limited supervision

In this experiment, we test whether our data augmentation technique can build better speech features that can be used for downstream tasks. Here, we use the Libri-light limited supervision phone classification task , which contains intentionally small training sets (10 min, 1h or 10 hours of labelled data). We fine-tune a linear phone classifier built on top of the CPC features with a CTC loss (frozen features). On 10 hours of data, we also fine-tune the entire network. Again, we additionally experiment with an architecture that has a 3-layer LSTM (CPC2-L3) (See Supplementary Table S2).

The results are in Table 4 and show an effect of signal-based data augmentation, both for pretraining and for fine tuning. For the supervised fine-tuning phase, we found out that we got the best results by using only pitch augmentation. Other methods having low or negative effects in this case. The combined effects of data augmentation on pretraining and fine-tuning adds up to 12-15% relative improvement across the different training sets. Interestingly, we find that with data augmentation we can beat the reference baseline (pretraining on 60k hours plus fine tuning on 10 hours) on frozen features with substantially less data (pretraining on 100 hours, plus fine tuning on 1 hour). Another point worth mentioning is that with data augmentation, 10 minutes of data on frozen features is sufficient to outperform the no-pretraining reference with 10 hours of labels.

Discussion

We have introduced WavAugment, a library for time-domain data augmentation and illustrated its use in the context of unsupervised contrastive representation learning, and in the context of learning with limited supervision. We found that pitch and additive noise are the most powerful data augmentation techniques for our implementation of contrastive predictive coding, yielding very good results in unsupervised representation learning in English, Mandarin and French. We further showed that these gains extend to fine tuning on very limited data yielding gains in PER. Interestingly, the two most popular data augmentation techniques that are typically done in the spectral domain (as in SpecAugment) do not work very well for CPC training. Furthermore, pitch and additive noise are techniques that can only be applied in the time domain. Further work will allow to determine whether the superiority of time domain noise augmentation over spectral ones is specific to the CPC loss or to the fact that our architecture starts directly from the waveform as opposed to using spectral features like Mel Filterbanks or MFCCs. Note that also combines several data augmentation techniques for unsupervised learning in an autoencoder architecture. Among data augmentation technique they use the most are two time-domain ones (reverberation and additive noise) and one spectral (band reject). It remains to be seen how pitch would fare in such a pretraining setup.

Conclusion

With data augmentation, CPC can take good advantage of relatively short (around 100 hours) clean and well segmented speech, although it is currently insufficient to learn competitively with very small amounts of data (between 2.5 and 50 hours). More research is needed to extend such techniques in both directions: with small amounts of data, and with very large, and potentially more noisy datasets. In addition, the differences that we observe between data-augmentation effects open the issue of more systematic exploration of data augmentation as a function of tasks and architectures.

References

S1 Supplementary Results

We started from the model described in : the encoder network is composed of 5 convolutional layers with kernel sizes , strides and hidden dimension 256. We worked with ReLU activation and inserted a channel normalization procedure between each convolutional layer. As far as the context network is concerned, we used a 2-layers LSTM. Finally, we used a single layer multihead transformer to do the prediction instead of several single head transformers. Table S1 shows different ablations that we ran to compare these different versions.

We ran our experiments using the Adam optimizer with $lr=2e-4,\beta_{1}=0.9,\beta_{2}=0.999$ . Although we didn’t resort to learning rate decay, we used a learning rate ramp for the first $10$ epochs.

S1.2 Changing the architecture: dataset size in presence of data augmentation

In the next experiment, we study the performance of our model in function of the size of the available data and the architecture size (controlled by the number of LSTM layers). We simulate the amounts of data available at ZeroSpeech2017 for Mandarin (3h), French (45h), and English (100h) by sub-sampling from LibriLight (3h and 45h) and using LibriSpeech (100h). In all experiments, we use the best data augmentation found in the main text (pitch+add+reverb-past). We report the obtained results in Table S2.

We observe that in the cases of 3h and 45h datasets, the architecture with 2 layers of LSTM still perform best. However, with 100h of data, increasing the model depth turns out to be beneficial. On comparing with the results reported in Table S1, we see that it is the presence of the data augmentation that allows us to leverage a deeper architecture.

S1.3 Frequency sensitive additive noise

Here, we explore how frequency filtering affects additive noise data augmentation. We did two experiments: band-pass filtering, and lowpass filtering. For bandpass, here are the frequency bands we applied to the MUSAN dataset: $Hz,$ Hz, $Hz,$ Hz, $Hz. The second band corresponds roughly to the range of human pitch (F0), the third, to the range of the first formant (F1), the fourth to the range of the second formant (F2). The extreme ranges (very low or very high frequencies) do not typically carry much information. Table S3 shows the effect of filtering in these bands before adding the noise to the speech signal. An optimal range seems to be$ Hz. For lowpass, we selected sucessive 100Hz bands, starting from zero.