Deep convolutional acoustic word embeddings using word-pair side information

Herman Kamper, Weiran Wang, Karen Livescu

Introduction

Most current speech processing systems rely on a deep architecture to classify speech frames into subword units (often phone states). This approach still relies on frame-level independence assumptions as well as a pronunciation lexicon for breaking up words into their subword constituents. As an alternative, some researchers have started to reconsider using whole words as the basic modelling unit.

Some of the earliest speech recognition systems were based on template-based whole-word modelling . This idea has been revisited in modern template-based automatic speech recognition (ASR) systems , as well as modern speech indexing applications such as query-by-example search . These systems typically use dynamic time warping (DTW) to quantify the similarity of phone or word segments of variable length. Recent work has also considered frame-level embeddings which map acoustic features to a new frame-level representation that is tailored to word discrimination when combined with DTW . DTW, however, has known inadequacies and is quadratic-time in the duration of the segments.

Levin et al. proposed a segmental approach where an arbitrary-length speech segment is embedded in a fixed-dimensional space such that segments of the same word type have similar embeddings. Segments can then be compared by simply calculating a distance in the embedding space, a linear time operation in the embedding dimensionality. Several approaches were developed in , and in these were successfully applied in a query-by-example search system.

Bengio and Heigold similarly used whole-word fixed-dimensional representations in a segmental ASR lattice rescoring system. Their acoustic embeddings are obtained from a convolutional neural network (CNN), trained with a combination of a word classification and a ranking loss. When combining the hypotheses of the baseline system with the embedding-based scores, ASR performance was improved. A similar approach was followed in , where long short-term memory (LSTM) networks were used to obtain whole-word embeddings for a query-by-example search task. Finally, Maas et al. trained a regression CNN that reconstructs a semantic word embedding from acoustic speech input; these features were used in a segmental conditional random field ASR system.

In this paper we compare several CNN-based approaches to each other and to the best approach of Levin et al. , on a word discrimination task. This task has been used in several other studies to assess the accuracy of acoustic embedding approaches without the need to train a complete recognition or search system. Building on ideas from earlier CNN-based approaches, we propose new networks that make use of weaker supervision in the form of known word pairs. The approach is based on Siamese networks: tied networks that take in pairs of input vectors and minimize or maximize a distance depending on whether a pair comes from the same or different classes . We show that a Siamese CNN trained with with a hinge-like contrastive loss function outperforms the best approach of Levin et al. , and performs similarly to a word classifier CNN, despite the weaker form of supervision. By reducing the Siamese CNN embedding dimensionality with a post-processing linear discriminant analysis, we also obtain a more compact embedding that maintains best performance.

Acoustic word embedding approaches

Typical embedding approaches use a training set of known word segments $\mathcal{Y}_{\text{train}}=\{Y_{i}\}_{i=1}^{N_{\textrm{train}}}$ to learn $f$ . Different degrees of supervision can be assumed, ranging from unsupervised, where the only knowledge of $\mathcal{Y}_{\text{train}}$ is that it contains unidentified word segments, to supervised, where the word label for each segment is known. Below we review previous work (Sections 2.1 and 2.2) and then present our own approaches which use weak supervision in the form of known word pairs (Section 2.3), and can also additionally use word labels to find lower-dimensional but still accurate embeddings (Section 2.4).

2 Word classification CNN

Instead of using the representation from the word classification CNN directly, Bengio and Heigold used a paired network with a ranking loss to map acoustic word embeddings into a common space with orthography-based word embeddings obtained by also mapping the word labels $\mathcal{W}_{\text{train}}$ into a lower-dimensional space. This was done to make it possible to use the classifier outputs in a particular lattice rescoring architecture, which requires scores for lattice arcs. The evaluation framework we use (Section 3.1) is designed to be decoupled from a recognition architecture, and we can therefore use the distributional representation from the classifier CNN directly. An investigation of whether embeddings of word labels can be additionally used to improve acoustic word embeddings is left for future work.

3 Word similarity Siamese CNNs

If the labels $\mathcal{W}_{\textrm{train}}$ for the training set $\mathcal{Y}_{\textrm{train}}$ are not known, a weaker form of supervision that has also been used is the knowledge that pairs of word segments in $\mathcal{Y}_{\textrm{train}}$ share the same unknown word type: $\mathcal{S}_{\textrm{train}}=\{(m,n):(Y_{m},Y_{n})\textrm{ are of the same type}\}$ . This type of side information is appealing since it is often easier to obtain in low-resource settings, for example by using an unsupervised term discovery system to find unidentified matching word pairs.

Such paired supervision has been used for several problems and domains, including phonetic discovery , semantic word embeddings and vision applications . Many of these studies use Siamese networks, a term used since the early 1990s to describe a pair of networks with tied parameters which is trained to optimize a distance function between representations of two data instances . To train these networks it is sometimes assumed that pairs not in $\mathcal{S}_{\textrm{train}}$ belong to different types; we also make this assumption here.

Figure 1(b) illustrates how we apply this idea to obtain acoustic word embeddings. The two sides of our Siamese network take padded inputs $\boldsymbol{\mathbf{Y}}_{1}$ and $\boldsymbol{\mathbf{Y}}_{2}$ . For the two sides we use CNNs similar to that of the word classification CNN. But instead of terminating in a softmax layer, the final fully connected layer on each side gives the desired acoustic embedding. In initial experiments on development data, we considered several loss functions, and here we focus only on the most successful ones. We found that losses based on cosine similarity outperformed Euclidean-based losses, and in particular the coscos2 loss from gave the best performance of the losses in :

This loss pushes the angle between embeddings of the same type to be zero, while embeddings of different types are pushed to be orthogonal.

In discrimination tasks, the decision of whether two data instances are of the same type is not based on their absolute distance, but rather their relative distance compared to other pairs. This motivates a margin-based (hinge) loss, similar to that of :

where $d_{\text{cos}}(\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2})=\frac{1-\cos(\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2})}{2}$ is the cosine distance between $\boldsymbol{\mathbf{x}}_{1}$ and $\boldsymbol{\mathbf{x}}_{2}$ , and $m$ is a margin parameter. Here, $\boldsymbol{\mathbf{x}}_{1}$ and $\boldsymbol{\mathbf{x}}_{2}$ are always of the same type while $\boldsymbol{\mathbf{x}}_{1}$ and $\boldsymbol{\mathbf{x}}_{3}$ are of different types. This loss is therefore at a minimum when all embeddings $\boldsymbol{\mathbf{x}}_{1}$ and $\boldsymbol{\mathbf{x}}_{2}$ of the same type are more similar by a margin $m$ than embeddings $\boldsymbol{\mathbf{x}}_{1}$ and $\boldsymbol{\mathbf{x}}_{3}$ of different types. The margin also gives some leeway to the model.

Although Siamese networks have been used widely, to our knowledge this is the first work which uses Siamese networks (in particular Siamese CNNs) to obtain acoustic word embeddings from speech.

4 Controlling embedding dimensionality

We aim to learn word embeddings that are both discriminative and compact (low-dimensional). The desired dimensionality may be guided by both computational and data constraints, and we may wish to be able to adjust it. For word classification networks (Section 2.2), the output dimensionality is given by the vocabulary size. In our experiments (next section), we explore adjusting the dimensionality by inserting an additional linear bottleneck layer before the final softmax, with the number of units corresponding to the desired final dimensionality. In Siamese networks (Section 2.3) the final dimensionality can be directly tuned. If we have access to word labels $\mathcal{W}_{\textrm{train}}$ in addition to word pairs $\mathcal{S}_{\textrm{train}}$ , we can also perform additional dimensionlity reduction on the Siamese CNN outputs using a supervised technique; in our experiments we use linear discriminant analysis (LDA).

Experiments

Ultimately we would like to evaluate the different acoustic embedding approaches for downstream speech recognition and search tasks. However, we do not want to be tied to a specific recognition architecture, and we would like to quickly compare many embedding approaches. We therefore use a word discrimination task developed for this purpose ; in the same-different task, we are given a pair of acoustic segments, each corresponding to a word, and we must decide whether the segments are examples of the same or different words.

This task can be approached in a number of ways, but typically it is done either via a DTW score between segments (when using frame-by-frame embeddings), or via a Euclidean or cosine distance between vectors (when embedding complete segments). In our evaluation, after training a model on $\mathcal{Y}_{\textrm{train}}$ , the acoustic word embeddings of a disjoint test set $\mathcal{Y}_{\textrm{test}}$ are computed. For every word pair in this set, the cosine distanceWe also tried Euclidean distance, but as in , cosine worked better. is calculated between their embeddings. Two words can then be classified as being of the same or different type based on some threshold, and a precision-recall curve is obtained by varying the threshold. To evaluate embeddings across different operating points, the area under the precision-recall curve is calculated to yield the final evaluation metric, referred to as the average precision (AP).

We use data from the Switchboard corpus of English conversational telephone speech. Data is parameterized as Mel-frequency cepstral coefficients (MFCCs) with first and second order derivatives, yielding 39-dimensional feature vectors. Cepstral mean and variance normalization (CMVN) is applied per conversation side. For the training set $\mathcal{Y}_{\textrm{train}}$ we use the set of about 10k word tokens from ; it consists of word segments of at least 5 characters and 0.5 seconds in duration extracted from a forced alignment of the transcriptions, and comprises about 105 minutes of speech. For the Siamsese CNNs, this set results in about 100k word segment pairs for $\mathcal{S}_{\textrm{train}}$ . For testing, we use the 11k-token set $\mathcal{Y}_{\textrm{test}}$ from , making the results from these studies directly comparable to the results obtained here.In , a slightly different training set was used. Nevertheless, the size of their training set is comparable to the set used here. This set was extracted from a portion of Switchboard distinct from $\mathcal{Y}_{\textrm{train}}$ . Similarly, we extracted an 11k-token development set.

As mentioned in Section 1, recent studies have also been using frame-level embedding approaches in combination with DTW to perform the same-different task. These approaches map the original features to a new frame-level representation that is tailored to word discrimination. We compare our results to that of , which uses posteriograms over a partitioned universal background model (UBM), as well as , which uses a correspondence autoencoder.

2 Network architectures

We used the Theano toolkit to implement the CNN-based models of Sections 2.2 and 2.3.CNN code: https://github.com/kamperh/couscous. Complete recipe: https://github.com/kamperh/recipe_swbd_wordembeds. Models are trained using ADADELTA , an adaptive learning rate stochastic optimization method that adapts over time based on an accumulation of past gradients; we set the momentum hyper-parameter to $\rho=0.9$ and the precision parameter to $\epsilon=10^{-6}$ . Input speech segments are padded to $n_{\text{pad}}=200\text{ frames}$ (2 s), which corresponds to the longest word segment in $\mathcal{Y}_{\text{train}}$ . The architectures of the CNNs were optimized separately on the development data for each network type, resulting in the following structures:

Word classifier CNN: 1-D convolution with 96 filters over 9 frames; ReLU; max pooling over 3 units; 1-D convolution with 96 filters over 8 units; ReLU; max pooling over 3 units; 1024-unit fully-connected ReLU; softmax layer over 1061 word types.

Word similarity Siamese CNN: two convolutional and max pooling layers as above; 2048-unit fully-connected ReLU; 1024-unit fully-connected linear linear; terminates in loss $l(\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2})$ .

For the word classifier CNN, we only train on words in $\mathcal{Y}_{\text{train}}$ that occur at least three times; this gives a subset of 87% of all tokens with 1061 unique word types. This minimum count was tuned on the development set. To see the effect of the convolutions, we also train a word classifier deep neural network (DNN) using two 2048-unit fully-connected ReLU layers and a 1061-unit softmax layer. For the Siamese CNN using $l_{\text{cos hinge}}$ , we use a margin $m=0.15$ (tuned on the development set). If we had used ReLUs in the final layer in the Siamese CNNs, the angles between embeddings would be restricted to $[0,\pi/2]$ ; we therefore use a final linear layer. All weights are initialized randomly; we run all models with five different initializations and report average performance and standard deviations.

3 Results

Table 1 shows AP performance on the test set from previous studies (models 1 to 4) as well as our newly implemented models (5 to 11).

The first three models perform word discrimination using DTW on frame-level embeddings of word segments; model 1 works directly on acoustic features, while models 2 and 3 work on features optimized for word discrimination. Model 3 yields the best previously reported result on this task. Model 4 is the best acoustic word embedding approach from (Section 2.1), representing the best previous result for an approach that produces embeddings of whole word segments.

Models 5 to 11 are the neural network-based approaches. The effect of using the convolutional layers is evident from the large improvement in AP of model 6 over model 5. Both of these models are trained on the word type labels $\mathcal{W}_{\text{train}}$ , which is also the type of supervision used for model 4, making the improvement of model 6 over model 4 noteworthy.For model 6, embeddings are taken from the final softmax output. We also experimented with embeddings from the softmax layer but before applying the exponential normalization; this gave worse development results. The dimensionality of the acoustic embeddings of model 6, however, is much larger than that of model 4. We therefore also trained a version of model 6 where the embedding is obtained from a linear bottleneck layer inserted just before the final softmax layer. The lower-dimensional embeddings from this approach (model 7) still improves on model 4 by a sizable margin.

Of the Siamese CNNs, the model with the $l_{\text{cos hinge}}$ loss (model 9, Section 2.3) outperforms its $l_{\text{cos cos}^{2}}$ counterpart, and yields a large improvement over model 4, which was the previous best acoustic embedding approach. It also gives similar performance to the word classification CNN (model 6), even though the pair-wise side information $\mathcal{S}_{\textrm{train}}$ used for model 9 is a weaker form of supervision than the fully labelled supervision $\mathcal{W}_{\text{train}}$ used in model 6. When reducing the embedding dimensionality to 50 (model 10), AP is still higher than any of the $d=50$ competitors. Model 9’s improvement over model 3 is also interesting since the former does not use any DTW alignment information. Finally, model 11 shows that LDA on the output of model 9 does not yield any improvement, but does produce a much smaller embedding without loss in performance. This model uses exactly the same word class supervision $\mathcal{W}_{\text{train}}$ as models 6 and 7.

4 Further discussion and analysis

Although the structures of models 8 and 9 are identical, the model using $l_{\text{cos hinge}}$ significantly outperforms its counterpart using $l_{\text{cos cos}^{2}}$ . This is in line with the fact that a loss like $l_{\text{cos hinge}}$ , which optimizes embeddings based on relative distances between positive and negative pairs, is much more closely aligned with the discrimination task than a loss like $l_{\text{cos cos}^{2}}$ , which looks at distances of word pairs in isolation (without regard to their distances relative to other pairs). The $l_{\text{cos hinge}}$ loss also allows more freedom in the model since it does not penalize same-word pairs $(\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{2})$ if they are already more similar by the margin $m$ than the corresponding different-word pairs $(\boldsymbol{\mathbf{x}}_{1},\boldsymbol{\mathbf{x}}_{3})$ .

The closer match between the same-different task and the training loss $l_{\text{cos hinge}}$ could also explain the improvements over the DTW-based model 3 (both using exactly the same supervision $\mathcal{S}_{\text{train}}$ ); this latter model aims to learn better features at the local frame level, but does so without regard to the (relative) similarities of complete segments.

Figure 2 shows AP on the development set when varying the target dimensionalities of the different CNN-based approaches (see Section 2.4). The $l_{\text{cos hinge}}$ Siamese CNN outperforms all the other models (apart from the post-processed LDA model) at all operating points, and gives stable performance over a range of dimensionalities ( $300$ and onwards). The word classifier CNN does much worse in this case (compared to the result of model 6), perhaps since embeddings here are not taken from the final layer, which is explicitly optimized for word classification, but from an intermediate layer. In contrast, for the Siamese CNNs, embeddings are always obtained directly from the layer that is optimized in the target loss. The figure also shows that when word labels $\mathcal{W}_{\text{train}}$ are available, compact embeddings can be obtained by performing LDA on top of the Siamese CNN representation, without loss in performance; this can prove to be important for downstream tasks which might require smaller embeddings.

Here, a relatively small set of labelled word examples $\mathcal{Y}_{\text{train}}$ is used to train the word classifier networks (as also done in the studies we compare to). In contrast, by using pairs of words and relative comparisons between them, a much larger set $\mathcal{S}_{\text{train}}$ is used for training Siamese networks. This type of paired supervision is ideal for generalizing to unseen word types, and is often easier to obtain in low-resource settings (see Section 2.3). While frame-level feature learning (model 3) can also use the larger pair-wise training set $\mathcal{S}_{\text{train}}$ , such approaches need to be coupled with DTW, which is limiting.

Conclusion

We studied several acoustic word embedding approaches based on convolutional neural networks (CNNs); these networks take a whole-word speech segment as input and produce a fixed-dimensional vector. Our best new approach is a Siamese CNN that uses a hinge-based loss function to minimize the distance between word pairs of the same type relative to the distance between pairs of different types. On the same-different word discrimination task, this approach yields an average precision (AP) of 0.549, an improvement over the best previously published results on this task with whole-word embeddings (0.365 AP) and DTW with learned frame features (0.469 AP). A word classifier CNN performs similarly (0.532 AP) to the Siamese CNN, but requires much stronger labelled supervision, and performs worse at smaller dimensionalties. Future work will consider sequence models (e.g. RNNs, LSTMs), and will apply these embeddings to downstream tasks such as term discovery, speech recognition, and search.