A segmental framework for fully-unsupervised large-vocabulary speech recognition

Herman Kamper, Aren Jansen, Sharon Goldwater

Introduction

Despite major advances in supervised speech recognition over the last few years, current methods still rely on huge amounts of transcribed speech audio, pronunciation dictionaries, and texts for language modelling. The collection of these pose a major obstacle for speech technology in under-resourced languages. In some extreme cases, unlabelled speech data might be the only available resource. In this zero-resource scenario, unsupervised methods are required to learn representations and linguistic structure directly from the speech signal. Such methods can, for instance, make it possible to search through a corpus of unlabelled speech using voice queries , allow topics within speech utterances to be identified without supervision , or can be used to automatically cluster related spoken documents . Similar techniques are required to model how human infants acquire language from speech input , and for developing robotic applications that can learn a new language in an unknown environment .

Interest in zero-resource speech processing has grown considerably in the last few years, with two central research areas emerging . The first deals with unsupervised representation learning, where the task is to find speech features (often at the frame level) that make it easier to discriminate between meaningful linguistic units (phones or words). This task has been described as ‘phonetic discovery’, ‘unsupervised acoustic modelling’ and ‘unsupervised subword modelling’, depending on the type of feature representations that are produced. Approaches include those using bottom-up trained Gaussian mixture models (GMMs) to produce frame-level posteriorgrams , using unsupervised hidden Markov models (HMMs) to obtain discrete categorical output in terms of discovered subword units , and using unsupervised neural networks (NNs) to obtain frame-level continuous vector representations .

The second area of zero-resource research deals with unsupervised segmentation and clustering of speech into meaningful units. This is important in tasks such as query-by-example search , where a system needs to find all the utterances in a corpus containing a spoken query, or in unsupervised term discovery (UTD), where a system needs to automatically find repeated word- or phrase-like patterns in a speech collection . UTD systems typically find and cluster only isolated acoustic segments, leaving the rest of the data as background. We are interested in full-coverage segmentation and clustering, where word boundaries and lexical categories are predicted for the entire input. Several recent studies share this goal . Successful full-coverage segmentation systems would perform a type of unsupervised speech recognition. This would allow downstream applications, such as query-by-example search and speech indexing (grouping together related utterances in a corpus), to be developed in a manner similar to when supervised systems are available. Unsupervised segmentation and clustering, however, is a daunting task, and current performance lags behind that of even minimally-supervised systems. Nevertheless, previous work has shown that high-error rate unsupervised systems can still be used effectively for a wide range of tasks including topic identification and clustering of spoken documents , speech-to-speech translation of low-resource languages , language recognition , and in improving purely supervised keyword search systems .

In previous work , we introduced a novel unsupervised segmental Bayesian model for full-coverage segmentation and clustering of small-vocabulary speech. Other approaches mostly perform frame-by-frame modelling using subword discovery with subsequent or joint word discovery. In contrast, our approach models whole-word units directly using a fixed-dimensional embedding representation; any potential word segment (of arbitrary length) is mapped to a fixed-length vector, its acoustic word embedding, and the model builds a whole-word acoustic model in the embedding space while jointly performing segmentation. In we evaluated the model in an unsupervised digit recognition task using the TIDigits corpus. Although it was able to accurately segment and cluster the small number of word types (lexical items) in the data, the same system could not be applied directly to multi-speaker data with larger vocabularies. This was due to the large number of embeddings that had to be computed, and the efficiency of the embedding method itself.

In this paper, we present a new system that uses the same overall framework as our previous small-vocabulary system, but with several changes designed to improve efficiency and speaker independence, allowing us to scale up to large-vocabulary multi-speaker data. We believe this is the first full-coverage unsupervised speech recognition system to be applied in this regime; previous systems have either focused on identifying isolated terms , were speaker-dependent , or used only a small vocabulary . Given this is the first attempt we are aware of, the results reported here will serve as a useful baseline for future work on unsupervised speech recognition of multi-speaker data with realistic vocabularies.

For our efficiency improvements, we use a bottom-up unsupervised syllable boundary detection method to eliminate unlikely word boundaries, reducing the number of potential word segments that need to be considered. We also use a computationally much simpler embedding approach based on downsampling .

For better speaker-independent performance, we incorporate a frame-level representation learning method introduced in our previous work : the correspondence autoencoder (cAE). The cAE uses noisy word pairs identified by an unsupervised term detection system to provide weak supervision for training a deep NN on aligned frame pairs; features are then extracted from one of the network layers. In we showed that cAE frame-level features outperform traditional features (MFCCs) and GMM-based representations in a multi-speaker intrinsic evaluation. Here, we show that the cAE features also improve performance of our full-coverage multi-speaker segmentation and clustering system (relative to MFCC features). These results are the first to show that unsupervised representation learning can improve a full-coverage zero-resource system.

We evaluate our approach in both speaker-dependent and speaker-independent settings on conversational speech datasets from two languages: English and Xitsonga. Xitsonga is an under-resourced southern African Bantu language . These datasets were also used as part of the Zero Resource Speech Challenge (ZRS) at Interspeech 2015 and we show that our system outperforms competing systems on several of the ZRS metrics. These metrics measure aspects ranging from cluster quality to segmentation performance. In particular, we find that by proposing a consistent segmentation and clustering over a whole utterance, our approach makes better use of the bottom-up syllabic constraints than the purely bottom-up syllable-based system of . Moreover, we achieve similar $F$ -scores for word tokens, types, and boundaries whether training in a speaker-dependent or speaker-independent mode.

By mapping the unsupervised output to ground truth transcriptions, we also evaluate word error rate (WER), a metric not included in the ZRS Challenge. Our best system has WERs of around 70–80% for speaker-dependent and 80–95% for speaker-independent settings. Although these are high error rates, nevertheless our results and analysis should provide useful baselines and guidance for future work in this area.Code for this work is available at https://github.com/kamperh/bucktsong_segmentalist.

Related work

Below we first discuss related work on unsupervised representation learning, followed by unsupervised term discovery (which we also compare our approach to), and, finally, full-coverage segmentation and clustering of unlabelled speech.

Unsupervised representation learning, in this context, involves finding a frame-level mapping from input features to a new representation that makes it easier to discriminate between different linguistic units (normally subwords or words).

Early studies used bottom-up approaches operating directly on the acoustics. Zhang and Glass successfully used posteriorgram features from an unsupervised GMM universal background model (UBM) for query-by-example search and term discovery. Similarly, Chen et al. used posteriorgrams from a non-parameteric infinite GMM. Approaches using unsupervised HMMs to perform a bottom-up tokenization of speech include the successive state-splitting algorithm of Varadarajan et al. , the more traditional iterative re-estimation and unsupervised decoding procedure of Siu et al. , and the non-parameteric Bayesian HMM of Lee and Glass . More recently, NNs have been used for bottom-up representation learning: stacked autoencoders (AEs), a type of unsupervised deep NN that tries to reconstruct its input, has been used in several studies .

The above approaches perform representation learning without regard to longer-spanning word- or phrase-like patterns in the data. In several recent studies, unsupervised term discovery (UTD) is used to automatically discover such patterns; these then serve as weak top-down constraints for subsequent representation learning. Jansen et al. showed that such constraints can be used to train HMMs and GMM-UBMs that significantly outperform their pure bottom-up counterparts. In our own work , we proposed the correspondence autoencoder (cAE): an AE-like deep NN that incorporates top-down constraints by using aligned frames from discovered words as input-output pairs. The model significantly outperformed the top-down GMM-UBM and stacked AEs in an intrinsic evaluation: isolated word discrimination. Since then, several researchers have used such weak top-down supervision in training unsupervised NN-based models . In this paper we show that cAE-learned features also improve performance of our multi-speaker unsupervised segmentation and clustering system.

Unsupervised term discovery

Unsupervised term discovery (UTD) is the task of finding meaningful word- or phrase-like patterns in unlabelled speech data. Most state-of-the-art UTD systems use a variant of dynamic time warping (DTW), called segmental DTW. This algorithm, developed by Park and Glass , identifies similar sub-sequences within two vector time series, rather than comparing entire sequences as in standard DTW. In most UTD systems, segmental DTW proposes pairs of matching segments which are then clustered using a graph-based method. Follow-up work has built on Park and Glass’ original method in various ways, for example through improved feature representations or by greatly improving its efficiency .

The baseline provided as part of the lexical discovery track of the Zero Resource Speech Challenge 2015 (ZRS) is a UTD system based on the earlier work of . The other UTD submission to the ZRS by Lyzinski et al. extended the baseline system using improved graph clustering algorithms. In our evaluation, we compare to both these systems. Our approach shares the property of UTD systems that it has no subword level of representation and operates directly on whole-word representations. However, instead of representing each segment as a vector time series with variable duration as in UTD, we map each potential word segment to a fixed-dimensional acoustic word embedding; we can then define an acoustic model in the embedding space and use it to compare segments without performing DTW alignment. Our system also performs full-coverage segmentation and clustering, in contrast to UTD, which segments and clusters only isolated acoustic patterns.

Full-coverage segmentation and clustering of speech

Early work considered full-coverage word segmentation of transcribed phonemic or phonetic symbol sequences . This laid the foundation for subsequent efforts to develop methods to entirely segment raw speech into word-like clusters. The approach at the the 2012 JHU CSLP workshop used symbolic word segmentation methods on top of automatically discovered subword units, but this pipelined approach gave very poor performance . More recent efforts attempt to segment raw speech directly; approaches include using non-negative matrix factorization , using iterative decoding and refinement for jointly training subword HMMs and a lexicon , and using discrete HMMs to model whole words in terms of discovered subword units . Below we highlight two studies which have inspired our work in particular.

In , Lee et al. developed a non-parametric hierarchical Bayesian model for full-coverage speech segmentation. Their model consists of a bottom subword acoustic modelling layer, a noisy channel model for capturing pronunciation variability, a syllable layer, and a highest-level word layer. When applied to speech from single speakers in the MIT Lecture corpus, most words with high TF-IDF scores were successfully discovered. As in their model, we also follow a Bayesian approach, which is useful for incorporating prior knowledge and for finding sparser solutions . However, where only considered single-speaker data, we additionally evaluate on large-vocabulary multi-speaker data.

Furthermore, in contrast to , our model operates directly at the whole-word level instead of having both word and subword models. By taking this different perspective, our segmental whole-word approach is a complementary contribution to the field of zero-resource speech processing. The approach is further motivated by the observation that it is often easier to identify cross-speaker similarities between words than between subwords , which is why most UTD systems focus on longer-spanning patterns. There is also evidence that infants are able to segment whole words from continuous speech while still learning phonetic contrasts in their native language . A benefit of the segmental embedding approach we use is that segments can be compared directly in a fixed-dimensional embedding space, meaning that word discovery can be performed using standard clustering methods (in our case using a Bayesian GMM acoustic model). Finally, segmental approaches do not make the frame-level independence assumptions of most of the models above; this assumption has long been argued against .

The second study we draw from is the ZRS submission of Räsänen et al. , which we use to help scale our approach to larger vocabularies. Their full-coverage word segmentation system relies on an unsupervised method that predicts boundaries for syllable-like units, and then clusters these units on a per-speaker basis. Using a bottom-up greedy mapping, reoccurring syllable clusters are then predicted as words. From here onward we use syllable to refer to the syllable-like units detected in the first step of their approach.

In our model, we incorporate the syllable boundary detection method of (the first component of their system) as a presegmentation method to eliminate unlikely word boundaries. Both human infants and adults use syllabic cues for word segmentation, and using such a bottom-up unsupervised syllabifier can therefore be seen as one way to incorporate prior knowledge of the speech signal into a zero-resource system .

Large-vocabulary segmental Bayesian model

In the following we describe our large-vocabulary system in detail, starting with a high-level overview of the model, illustrated in Figure 1.

Initially, however, we do not know where words start and end in the stream of features. But if we have a GMM acoustic model, we can use this model to segment an utterance by choosing word boundaries that yield segments (acoustic word embeddings) that have high probability under the acoustic model. Our full system therefore initializes word boundaries at random, extracts word embeddings, clusters them using the Bayesian GMM, and then iteratively re-analyzes each utterance (jointly re-segmenting it and re-clustering the segments) based on the current acoustic model. The result is a complete segmentation of the input speech and a prediction of the component to which every word segment belongs. The model is implemented as a single blocked Gibbs sampler, and exact details are given next.

Given the embedded word vectors $\mathcal{X}=\{\boldsymbol{\mathbf{x}}_{i}\}_{i=1}^{N}$ from the current segmentation hypothesis, the acoustic model needs to assign each acoustic word embedding $\boldsymbol{\mathbf{x}}_{i}$ to one of $K$ clusters, with each cluster corresponding to a hypothesized word type. We use a Bayesian GMM as acoustic model, with a conjugate Dirichlet prior over its mixture weights $\boldsymbol{\mathbf{\pi}}$ and a conjugate diagonal-covariance Gaussian prior over its component means $\left\{\boldsymbol{\mathbf{\mu}}_{k}\right\}_{k=1}^{K}$ , which allows us to integrate out these parameters. The model, illustrated in Figure 2, is formally defined as:

Latent variable $z_{i}$ indicates the component to which $\boldsymbol{\mathbf{x}}_{i}$ is assigned. All $K$ components share the same fixed covariance matrix $\sigma^{2}\boldsymbol{\mathbf{I}}$ . The hyperparameters of the mixture components are denoted together as $\boldsymbol{\mathbf{\beta}}=(\boldsymbol{\mathbf{\mu}}_{0},\sigma_{0}^{2},\sigma^{2})$ . These hyperparameters could potentially be learned themselves, but here we set them by hand based on previous studies, as described in Section 4.3.

Given $\mathcal{X}$ , we infer the component assignments $\boldsymbol{\mathbf{z}}=(z_{1},z_{2},\ldots,z_{N})$ using a collapsed Gibbs sampler . This is done in turn for each $z_{i}$ conditioned on all the other current component assignments :

where $\boldsymbol{\mathbf{z}}_{\backslash i}$ is all latent component assignments excluding $z_{i}$ and $\mathcal{X}_{k\backslash i}$ is the set of embedding vectors assigned to component $k$ apart from $\boldsymbol{\mathbf{x}}_{i}$ . The first term in (5) can be calculated as:

where $N_{k\backslash i}$ is the number of embedding vectors from mixture component $k$ without taking $\boldsymbol{\mathbf{x}}_{i}$ into account [52, p. 843]. This term can be interpreted as a discounted unigram language modelling probability. The term $p(\boldsymbol{\mathbf{x}}_{i}|\mathcal{X}_{k\backslash i};\boldsymbol{\mathbf{\beta}})$ in (5) is the posterior predictive of $\boldsymbol{\mathbf{x}}_{i}$ , which (because of the conjugate prior) is a spherical covariance Gaussian distribution with analytic expressions for its mean and covariance parameters ; these expressions are given in Appendix A. Intuitively, component assignment sampling in (5) is therefore based on a combination of language model and acoustic scores.

Above we described clustering given the current segmentation. But segmentation and clustering are performed jointly: for the utterance under consideration, a segmentation is sampled using the current acoustic model (marginalizing over cluster assignments for each potential segment), and clusters are then resampled for the newly created segments. Pseudo-code for the blocked Gibbs sampler that implements this algorithm is given in Algorithm 1. The acoustic data is denoted as $\{\boldsymbol{\mathbf{s}}_{i}\}_{i=1}^{S}$ , where every utterance $\boldsymbol{\mathbf{s}}_{i}$ consists of acoustic frames $\boldsymbol{\mathbf{y}}_{1:M_{i}}$ , and $\mathcal{X}(\boldsymbol{\mathbf{s}}_{i})$ denotes the embedding vectors under the current segmentation for utterance $\boldsymbol{\mathbf{s}}_{i}$ . In Algorithm 1, utterance $\boldsymbol{\mathbf{s}}_{i}$ is selected according to a random permutation of all utterances; the embeddings from the current segmentation $\mathcal{X}(\boldsymbol{\mathbf{s}}_{i})$ are removed from the Bayesian GMM; a new segmentation is sampled; and finally the embeddings from this new segmentation are added back into the Bayesian GMM. Line 5 uses the forward filtering backward sampling dynamic programming algorithm to sample the new embeddings; details of this step are given in Appendix B.

Unsupervised syllable boundary detection

Without any constraints, the input at the bottom of Figure 1 could be segmented into any number of possible words using a huge number of possible segmentations. In , potential word segments were therefore required to be between 200 ms and 1 s in duration, and word boundaries were only considered at 20 ms intervals. This still results in a very large number of possible segments. Here we instead use a syllable boundary detection method to eliminate unlikely word boundaries, with word candidates spanning a maximum of six syllables. On the waveform in Figure 1, solid and dashed lines are used to indicate the only positions where boundaries are considered during sampling, as determined by the syllabification method.

Räsänen et al. evaluated several syllable boundary detection algorithms, and we use the best of these. First the envelope of the raw waveform is calculated by downsampling the rectified signal and applying a low-pass filter. Inspired by neuropsychological studies which found that neural oscillations in the auditory cortex occur at frequencies similar to that of the syllabic rhythm in speech, the calculated envelope is used to drive a discrete time oscillation system with a centre frequency of typical syllabic rhythm. This discrete time system is used to mathematically model the damped harmonic oscillations in the auditory system, which is hypothesized to match syllabic rhythm. Minima in the oscillator’s amplitude give the predicted syllable boundaries. In this work, we use the syllabification code kindly provided by the authors of without any modification and with the default parameter settings.

Acoustic word embeddings and unsupervised representation learning

A simple and fast approach to obtain acoustic word embeddings is to uniformly downsample so that any segment is represented by the same fixed number of vectors . A similar approach is to divide a segment into a fixed number of intervals and average the frames in each interval . The downsampled or averaged frames are then flattened to obtain a single fixed-length vector. Although these very simple approaches are less accurate at word discrimination than the approach used before in , they have been effectively used in several studies, including , and are computationally much more efficient. Here we use downsampling as our acoustic word embedding function $f_{e}$ in Figure 1; we keep ten equally-spaced vectors from a segment, and use a Fourier-based method for smoothing to deal with cases where segments are not exactly divisible .

Figure 1 shows that $f_{e}$ takes as input a sequence of frame-level features from the feature extracting function $f_{a}$ . One option for $f_{a}$ is to simply use MFCCs. As an alternative, we incorporate unsupervised representation learning (Section 2.1) into our approach by using the cAE as a feature extractor. Complete details of the cAE are given in , but we briefly outline the training procedure here. The UTD system of is used to discover word pairs which serve as weak top-down supervision. The cAE operates at the frame level, so the word-level constraints are converted to frame-level constraints by aligning each word pair using DTW. Taken together across all discovered pairs, this results in a set of $F$ frame-level pairs $\left\{\left(\boldsymbol{\mathbf{y}}_{i,a},\boldsymbol{\mathbf{y}}_{i,b}\right)\right\}_{i=1}^{F}$ . Here, each frame is a single MFCC vector. For every pair $\left(\boldsymbol{\mathbf{y}}_{a},\boldsymbol{\mathbf{y}}_{b}\right)$ , $\boldsymbol{\mathbf{y}}_{a}$ is presented as input to the cAE while $\boldsymbol{\mathbf{y}}_{b}$ is taken as output, and vice versa. The cAE consists of several non-linear layers which are initialized by pretraining the network as a standard autoencoder. The cAE is then tasked with reconstructing $\boldsymbol{\mathbf{y}}_{b}$ from $\boldsymbol{\mathbf{y}}_{a}$ , using the loss $\left|\left|\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{y}}_{a}\right|\right|^{2}$ . To use the trained network as a feature extractor $f_{a}$ , the activations in one of its middle layers are taken as the new feature representation.

Experiments

We use three datasets, summarized in Table 1. The first two are disjoint subsets extracted from the Buckeye corpus of conversational English , while the third is a portion of the Xitsonga section of the NCHLT corpus of languages spoken in South Africa . Xitsonga is a Bantu language spoken in southern Africa; although it is considered under-resourced, more than five million people use it as their first language.http://www.ethnologue.com/language/tso

The two sets extracted from Buckeye, referred to as English1 and English2, respectively contain six and five hours of speech, each from twelve speakers (six female and six male). The Xitsonga dataset consists of 2.5 hours of speech from 24 speakers (twelve female, twelve male). English2 and the Xitsonga data were used as test sets in the ZRS challenge, so we can compare our system to others using the same data and evaluation framework . English1 was extracted for development purposes from a disjoint portion of Buckeye to match the distribution of speakers in English2. For all three sets, speech activity regions are taken from forced alignments of the data, as was done in the ZRS. From Table 1, the average duration of a word in an English set is around 250 ms, while for Xitsonga it is about 450 ms.

Our model is unsupervised, which means that the concepts of training and test data become blurred. We run our model on all sets separately—in each case, unsupervised modelling and evaluation is performed on the same set. English1 is the only set used for any development (specifically for setting hyperparameters) in any of the experiments; both English2 and Xitsonga are treated as unseen final test sets. This allows us to see how hyperparameters generalize within language on data of similar size, as well as across language on a corpus with very different characteristics.

Evaluation

The evaluation of zero-resource systems that segment and cluster speech is a research problem in itself . We use a range of metrics that have been proposed before, all performing some mapping from the discovered structures to ground truth forced alignments of the data, as illustrated in Figure 3.

Average cluster purity first aligns every discovered token to the ground truth word token with which it overlaps most. In Figure 3 the token assigned to cluster 931 would be mapped to the true word ‘yeah’, and the 477-token mapped to ‘mean’. Every discovered word type (cluster) is then mapped to the most common ground truth word type in that cluster. E.g. if most of the other tokens in cluster 931 are also labelled as ‘yeah’, then cluster 931 would be labelled as ‘yeah’. Average purity is then defined as the total proportion of correctly mapped tokens in all clusters. For this metric, more than one cluster may be mapped to a single ground truth type (i.e. many-to-one) .

Unsupervised word error rate (WER/WER ${}_{\text{m}}$ ) uses a similar word-level mapping and then aligns the mapped decoded output from a system to the ground truth transcriptions . Based on this alignment we calculate $\textrm{WER}=\frac{S+D+I}{N}$ , with $S$ the number of substitutions, $D$ deletions, $I$ insertions, and $N$ the tokens in the ground truth. The cluster mapping can be done in one of two ways: many-to-one, where more than one cluster can be assigned the same word label (as in purity), or using a greedy one-to-one mapping, where at most one cluster is mapped to a ground truth word type. The latter, which we denote simply as WER, might leave some cluster unassigned and these are counted as errors . For the former, denoted as WER ${}_{\text{m}}$ , all clusters are labelled. Depending on the downstream speech task, it might be acceptable to have multiple clusters that correspond to the same true word; WER penalizes such clusters, while WER ${}_{\text{m}}$ does not. WER is a useful metric since it is easily interpretable and well-known in the speech community.

Normalized edit distance (NED) is the first of the ZRS metrics (the rest follow). These metrics use a phoneme-level mapping: each discovered token is mapped to the sequence of ground truth phonemes of which at least 50% or 30 ms are covered by the discovered segment, i.e. if a phoneme overlaps with either 30 ms or 50% of its duration with the discovered segment, it becomes part of the phoneme sequence to which that segment is mapped . In Figure 3, the 931-token would be mapped to /y ae/ and the 477-token to /ay m iy n/. For a pair of discovered segments, the edit distance between the two phoneme strings is divided by the maximum of the length of the two strings. This is averaged over all pairs predicted to be of the same type (cluster), to obtain the final NED score. If all segments in each cluster have the same phoneme string, then $\textrm{NED}=0$ , while if all phonemes are different, $\textrm{NED}=1$ . NED is useful in that it does not make the assumptions that the discovered segments need to correspond to true words (as in cluster purity and WER), and it only considers the patterns returned by a system (so it does not require full coverage, as WER does). As an example, if a cluster contains /m iy/ from a realization of the word ‘meaningful’ and a token /m iy n/ from the true word ‘mean’, then NED would be $1/3$ for this two-token cluster.

Word boundary precision, recall, $F$ -score are calculated by comparing word boundary positions proposed by a system to those from forced alignments of the data, falling within some tolerance. A tolerance of 20 ms is mostly used , but for the ZRS the tolerance is 30 ms or 50% of a phoneme (to match the mapping). In Figure 3 the detected boundary (dashed line) would be considered correct if it is within the tolerance from the true word boundary between ‘yeah’ and ‘i’.

Word token precision, recall, $F$ -score compare how accurately proposed word tokens match ground truth word tokens in the data. In contrast to the word boundary scores, both boundaries of a predicted word token need to be correct. In Figure 3, the system would receive credit for the 931-token since it is mapped to /y ae/ and therefore match the ground truth word token ‘yeah’. However, the system would be penalized for the 477-token (mapped to /ay m iy n/) since it fails to predict word tokens corresponding to /ay/ and /m iy n/ (the ground truth words ‘i’ and ‘mean’). Both the word boundary and word token metrics give a measure of how accurately a system is segmenting its input into word-like units.

Word type precision, recall, $F$ -score compare the set of distinct phoneme mappings from the tokens returned by a system to the set of true word types in the ground truth alignments. If any discovered word token maps to a phoneme sequence that is also found as a word in the ground truth vocabulary, the system is credited for a correct discovery of that word type. For example if the type /y ae/ (as in ‘yeah’) occurs in the ground truth alignment, the system needs to return at least one token that is mapped to /y ae/.

We evaluate our model in both speaker-dependent and speaker-independent settings. Multiple speakers make it more difficult to discover accurate clusters: non-matching linguistic units might be more similar within-speaker than matching units across speakers. For the speaker-dependent case, the model is run and scores are computed on each speaker individually, then performance is averaged over speakers. In the speaker-independent case, the system is run and scores computed over the entire multi-speaker dataset at once. This typically results in worse purity, NED and WER ${}_{\text{m}}$ scores since the task is more difficult and clusters are noisier. WER is affected even more severely due to the one-to-one mapping that it uses; if there are two perfectly pure clusters that contain tokens from the same true word, but the two clusters are also perfectly speaker-dependent, then only one of these clusters would be mapped to the true word type and the other would be counted as errors. Despite the adverse effect on these metrics, it is of practical importance to evaluate a zero-resource system in the speaker-independent setting.

Model development and hyperparameters

Most model hyperparameters are set according to previous work (as referenced below). Any changes are based exclusively on performance on English1.

Training parameters for the cAE (Section 3.3) are based on . The model is pretrained as a standard autoencoder on all data (in a particular set) for 5 epochs using minibatch stochastic gradient descent with a batch size of 2048 and a fixed learning rate of $2\cdot 10^{-3}$ . Subsequent correspondence training is performed for 120 epochs using a learning rate of $32\cdot 10^{-3}$ . Each pair is presented in both directions as input and output. Pairs are extracted using the UTD system of : for English1, 14 494 word pairs are discovered; for English2, 10 769 pairs; and for Xitsonga, 6979. The cAE is trained on each of these sets separately. In all cases, the model consists of nine hidden layers of 100 units each, except for the eighth layer which is a bottleneck layer of 13 units. We use $\tanh$ as non-linearity. The position of the bottleneck layer is based on intrinsic evaluation on English1. Although it is common in NN speech systems to use nine or eleven sliding frames as input, we use single-frame cepstral mean and variance normalized MFCCs with first and second order derivatives (39-dimensional), as also done in . For feature extraction, the cAE is cut at the bottleneck layer, resulting in 13-dimensional output (chosen to match the dimensionality of the static MFCCs). For both the MFCC and cAE acoustic word embeddings, we downsample a segment to ten frames, resulting in 130-dimensional embeddings. As in , embeddings are normalized to have unit length.

For the acoustic model (Section 3.1) we use the following hyperparameters, as in : all-zero vector for $\boldsymbol{\mathbf{\mu}}_{0}$ , $\sigma_{0}^{2}=\sigma^{2}/\kappa_{0}$ , $\kappa_{0}=0.05$ and $a=1$ . For MFCC embeddings we use $\sigma^{2}=1\cdot 10^{-3}$ for the fixed shared spherical covariance matrix, while for cAE embeddings we use $\sigma^{2}=1\cdot 10^{-4}$ . This was based on speaker-dependent English1 performance. We found that $\sigma^{2}$ is one of the parameters most sensitive to the input representation and often requires tuning; generally, however, it is robust if it is chosen small enough (in the ranges used here).

We use the oscillator-based syllabification system of Räsänen et al. without modification. Word candidates are limited to span a maximum of six syllables. One difficulty is to decide beforehand how many potential word clusters (the number of components $K$ in the acoustic model) we need. Here we follow the same approach as in : we choose $K$ as a proportion of the number of discovered syllable tokens. For the speaker-dependent settings, we set $K$ as $20\%$ of the number of syllables, based on English1 performance. On average, this amounts to $K=1549$ on English1, $K=1195$ on English2, and $K=298$ on Xitsonga. Compared to the average number of word types per speaker shown in Table 1, these numbers are higher for the English sets and slightly lower for Xitsonga. For speaker-independent models, we use $5\%$ of the syllable tokens, amounting to $K=4647$ on English1, $K=3584$ on English2, and $K=1789$ on Xitsonga. These are lower than the true number of total word types shown in Table 1. On English1, speaker-independent performance did not improve when using a larger $K$ and inference was much slower.

To improve sampler convergence, we use simulated annealing . We found that convergence is improved by first running the sampler in Algorithm 1 without sampling boundaries. In all experiments we do this for 15 iterations. Subsequently, the complete sampler is run for $J=15$ Gibbs sampling iterations with 3 annealing steps. Word boundaries are initialized randomly by setting boundaries at allowed locations with a $0.25$ probability.

Given the common setup above, we consider three variants of our approach:

BayesSeg is the most general segmental Bayesian model. In this model, a word segment can be of any duration, as long as it spans less than six syllables.

BayesSegMinDur is the same as BayesSeg, but requires word candidates to be at least 250 ms in duration; on English1, this improved performance on several metrics. Such a minimum duration constraint is also used in most UTD systems .

SyllableBayesClust clusters the discovered syllable tokens using the Bayesian GMM, but does not sample word boundaries. It can be seen as a baseline for the two models above, where segmentation is turned off and the detected syllable boundaries are set as initial (and permanent) word boundaries. All word candidates therefore span a single syllable in this model.

Results: Word error rates and analysis

Table 2 shows one-to-one and many-to-one WERs for the different speaker-dependent models on the three datasets. The trends in WER using one-to-one and many-to-one mappings are similar, with the absolute performance of the latter consistently better by around 10% to 20% absolute. The performance on Xitsonga varies much more dramatically than on the English datasets, with WER ranging from around 140% to 75% and WER ${}_{\text{m}}$ from 135% to 69%.From its definition, WER is more than 100% if there are more substitutions, deletions and insertions than ground truth tokens. Table 1 shows that the characteristics of the Xitsonga data are quite different from the English sets. For the speaker-dependent case here, much less data is available per Xitsonga speaker (just over six minutes on average) than for an English speaker (more than ten minutes), which might (at least partially) explain why error rates vary much more dramatically on Xitsonga. Moreover, there is a much higher proportion of multisyllabic words in Xitsonga , as reflected in the average duration of words which is almost twice as long in the Xitsonga than in the English data (Section 4.1).

Comparing the results for the three systems using MFCC features indicates that, on all three datasets, allowing the system to infer word boundaries across multiple syllables (BayesSeg) yields better performance than treating each syllable as a word candidate (SyllableBayesClust). Incorporating a minimum duration constraint (BayesSegMinDur) improves performance further. The relative differences between these systems are much more pronounced in Xitsonga, presumably due to the higher proportion of multisyllabic words. Despite the high error rates, this analysis nevertheless shows the benefits of top-down segmentation and minimum duration constraints; using bootstrap confidence interval estimation Sampling with replacement at the utterance level, $B=1000$ bootstrap samples of a dataset are generated. For a single system, WER can then be calculated for each of these samples in order to estimate the spread of the WER around its mean. To compare two systems, the difference in WER is calculated when evaluating both systems on each of the samples, giving an estimate of the probability of improvement of one system over another. See for complete details., these improvements of BayesSeg over SyllableBayesClust and of BayesSegMinDur over BayesSeg were found to be statistically significant at the 99.9% level for all three datasets and for both the WER and WER ${}_{\text{m}}$ metrics.

Table 2 also shows that in most cases the cAE features perform similarly to MFCC features in these speaker-dependent systems, although there is a large improvement in Xitsonga for the BayesSeg system when switching to cAE features (from 116.2% to 107.9% in WER and from 109.5% to 100.5% in WER ${}_{\text{m}}$ , again significant at the 99.9% level).

To get a better insight into the types of errors that the models make, Tables 3 and 4 give a breakdown of word boundary detection scores, individual error rates, and average cluster purity on English2 and Xitsonga, respectively. Bootstrap estimates of two standard deviations around each WER are also given, indicating the range in which the true WER lie with 95% probability . A word boundary tolerance of 20 ms is used , with a greedy one-to-one mapping for calculating error rates. SyllableBayesClust gives an upper-bound for word boundary recall since every syllable boundary is set as a word boundary. The low recall (28.9% and 24.8%) could potentially be improved by using a better syllabification method, but we leave such an investigation for future work.

Table 3 shows that on English2, the MFCC-based BayesSeg and BayesSegMinDur models under-segment compared to SyllableBayesClust, causing systematically poorer word boundary recall and $F$ -scores and an increase in deletion errors. However, this is accompanied by large reductions in substitution and insertion error rates, resulting in overall WER improvements and more accurate clusters when boundaries are inferred (45.1% purity, BayesSeg-MFCC) rather than using fixed syllable boundaries (42%, SyllableBayesClust), with further improvements when not allowing short word candidates (56%, BayesSegMinDur-MFCC).

In contrast to English2, Table 4 shows that on Xitsonga, SyllableBayesClust heavily over-segments causing a large number of insertion errors. This is not surprising since every syllable is treated as a word, while most of the true Xitsonga words are multisyllabic. At the cost of more deletions and poorer word boundary detection, BayesSeg-MFCC and BayesSegMinDur-MFCC systematically reduces substitution and insertion errors, again resulting in better overall WER and average cluster purity. Where the cAE-based models on English2 performed more-or-less on par with their MFCC counterparts, on Xitsonga the cAE embeddings yield large improvements on some metrics: by switching to cAE embeddings, the WER of BayesSeg improves by 8.3% absolute, while average cluster purity is 13.6% better for BayesSegMinDur.

Speaker-independent models

Table 5 gives the performance of different speaker-independent models. Compared to the speaker-dependent results of Table 2, performance is worse for all models and datasets. Dealing with multiple speakers is clearly challenging for these unsupervised systems. Nevertheless, the analysis still allows us to compare the different variants of our approach. As in the speaker-dependent case, BayesSegMinDur is the best performing MFCC system, followed by BayesSeg, and SyllableBayesClust performs worst; again these differences are significant at the 99.9% level. In the speaker-dependent experiments, some MFCC-based models slightly outperformed their cAE counterparts. Here, however, the WERs of cAE models are identical or improved in all cases; for Xitsonga in particular, improvements are obtained by using cAE features in both BayesSeg (improvement of 26.3% absolute in WER) and BayesSegMinDur (7.4%). The cAE-based BayesSegMinDur model is the only speaker-independent Xitsonga model with a WER less than 100%. Again, by allowing more than one cluster to be mapped the same true word type, WER ${}_{\text{m}}$ scores are lower than WER. On English, the cAE-based models do not yield better WER ${}_{\text{m}}$ than their MFCC counterparts, probably because WER ${}_{\text{m}}$ does not penalize for creating separate speaker- or gender-specific clusters (these would just get mapped to the same word for scoring). Nevertheless, the cAE features still yield large improvements in Xitsonga. Word boundary scores and substitution, deletion and insertion errors (not shown) follow a similar pattern to that of the speaker-dependent models. Bootstrap estimates of the spread around the individual WERs were in the same order as those in Tables 3 and 4; the SyllableBayesClust Xitsonga system has the biggest spread with the true WER lying in $167.2\pm 1.6\%$ with 95% probability.

To better illustrate the benefits of unsupervised representation learning, Table 6 shows general purity measures for the speaker-independent MFCC- and cAE-based BayesSegMinDur models. Average cluster purity is as defined before. Average speaker purity is similarly defined, but instead of considering the mapped ground truth label of a segmented token, it considers the speaker who produced it: speaker purity is 100% if every cluster contains tokens from a single speaker, while it is $1/12=8.3\%$ if all clusters are completely speaker balanced for the English sets and $1/24=4.2\%$ for Xitsonga. Average gender purity is similarly defined: it is 100% if every cluster contains tokens from a single gender, while $1/2=50\%$ indicates a perfectly gender-balanced cluster. Ideally, a speaker-independent system should have high cluster purity and low speaker and gender purities. Table 6 indicates that for all three datasets, cAE-based embeddings are less speaker and gender discriminative, and have higher or similar cluster purity compared to the MFCC-based embeddings.

Qualitative analysis and summary

Qualitative analysis involved concatenating and listening to the audio from the tokens in some of the biggest clusters of the best speaker-dependent and -independent models. Apart from the trends mentioned already, others also became immediately apparent. Despite the low average cluster purity ranging from 30% to 60% in the analyses above, we found that most of the clusters are acoustically very pure: often tokens correspond to the same syllable or partial word, but occur within different ground truth words. For example, a cluster with the word ‘day’ had the corresponding portions from ‘daycare’ and ‘Tuesday’. These are marked as errors for cluster purity and WER calculations. In the next section, we use NED as metric, which does not penalize such partial word matches. The biggest clusters often correspond to filler-words. As an example, speaker S38 from English1 had several clusters corresponding to ‘yeah’ and ‘you know’. But the BayesSegMinDur-MFCC model applied to S38 also discovered pure clusters corresponding to ‘different’, ‘people’ and ‘five’. For the speaker-independent BayesSegMinDur-cAE system, the biggest clusters consisted of instances of ‘um’, ‘uh’, ‘oh’, ‘so’ and ‘yeah’.

In summary, the high error rates reported above indicate that significant effort is still required in order to achieve reasonable performance with such zero-resource methods. A comparison of Tables 2 and 5 shows that dealing with multiple speakers is particularly challenging—recent zero-resource work has started to investigate this aspect specifically . Nevertheless, the above analysis allowed us to compare and draw conclusions regarding the different variants of our approach. Specifically, although under-segmentation occurs in the BayesSeg and BayesSegMinDur models, these models yield more accurate clusters and thereby improve overall purity and WER. In most cases, cAE embeddings either yield similar or improved performance compared to MFCCs. In particular in the speaker-independent case, cAE-based models discover clusters that are more speaker- and gender-independent. This illustrates the benefit of incorporating weak top-down supervision for unsupervised representation learning within a zero-resource system.

Results: Comparison to other systems

We now compare our approach to others using the evaluation framework provided as part of the ZRS challenge . We compare our approach to three systems:

ZRSBaselineUTD is the UTD system used as official baseline in the challenge (see Section 2.2).

UTDGraphCC is the best UTD system of , employing a connected component graph clustering algorithm to group discovered segments (also Section 2.2).

For our approach, we focus on systems that performed best on English1 in the previous section: for the speaker-dependent setting we use the MFCC-based BayesSegMinDur system, while for the speaker-independent setting we use the cAE-based BayesSegMinDur model. The performance of all our system variants using all of the ZRS metrics are given in Appendix C.

Figure 4 shows the NED scores of the different systems on English2 and Xitsonga. ZRSBaselineUTD yields the best NED on both languages, with UTDGraphCC also performing well. UTD systems like these explicitly aim to discover high-precision clusters of isolated segments, but do not cover all the data. They are therefore tailored to NED, which only evaluates the patterns discovered by the method and does not evaluate recall on the rest of the data. In contrast, SyllableSegOsc ${}^{\text{+}}$ and our own systems perform full-coverage segmentation. Of these, our systems achieve better NED than SyllableSegOsc ${}^{\text{+}}$ on both languages, indicating that the discovered clusters in our approach are more consistent. Even when running our system in a speaker-independent setting (BayesSegMinDur-cAE in the figure), our approach outperforms the speaker-dependent SyllableSegOsc ${}^{\text{+}}$ .

Figures 5 and 6 show the token, type and boundary $F$ -scores on the two languages. For comparison, word token $F$ -scores of less than $4\%$ were achieved at the 2012 JHU CSLP workshop, although a different dataset was used . Apart from word type $F$ -score on Xitsonga, our models outperform all other approaches in the direct comparison here. The UTD systems struggle on these metrics since the $F$ -scores are based on precision and recall over the entire input. The full-coverage SyllableSegOsc ${}^{\text{+}}$ is therefore our strongest competitor in most cases. The prediction of word candidates from reoccurring cluster sequences in SyllableSegOsc ${}^{\text{+}}$ is done greedily and bottom-up, without regard to other word mappings in an utterance. In contrast, BayesSegMinDur samples word boundaries and cluster assignments together by taking a whole utterance into account; it imposes a consistent top-down segmentation, while simultaneously adhering to bottom-up syllable boundary detection and minimum duration constraints. The result is a more accurate segmentation of the data. Note that in BayesSeg it is easy to incorporate additional bottom-up constraints (such as a minimum duration) and these are considered jointly with segmentation. In contrast, such a minimum duration constraint would require additional heuristics in the pure bottom-up approach of .

The results in Figures 5 and 6 also indicate that our speaker-independent system performs on par with the speaker-dependent system on these metrics; despite less accurate clusters (in terms of purity, WER and NED), the speaker-independent models still yields an accurate segmentation of the data, outperforming both speaker-independent UTD baselines and the speaker-dependent SyllableSegOsc ${}^{\text{+}}$ .

We conclude that by hypothesizing word boundaries consistently over an utterance rather than taking these decisions in isolation, our approach yields more accurate clusters (NED) that correspond better to true words (word type $F$ -score) than the full-coverage syllable-based approach of . It also segments the data more accurately (word token and boundary $F$ -scores), even when applying the model to data from multiple speakers. However, despite the benefits of our model, the algorithm of is much simpler in terms of computational complexity and implementation. Compared to UTD systems which aim to find high-quality reoccurring patterns but do not cover all the data, the items in our clusters have a poorer match to each other (NED), but correspond better to true words on the English data (word type $F$ -score). On both languages, our full-coverage method also segments the data better into word-like units (word boundary and token $F$ -scores) than the UTD systems.

Conclusion

We presented a segmental Bayesian model which segments and clusters conversational speech audio—a first attempt to evaluate a full-coverage zero-resource system on multi-speaker large-vocabulary data. The system limits word boundary positions by using a bottom-up presegmentation method to detect syllable-like units, and relies on a segmental approach where word segments are represented as fixed-dimensional acoustic word embeddings.

Our speaker-dependent system achieves WERs of around 84% on English and 76% on Xitsonga data, outperforming a purely bottom-up method that treats each syllable as a word candidate. Despite much worse speaker-independent performance, here we achieve improvements by incorporating frame-level features from an autoencoder-like neural network trained using weak top-down constraints. This results in clusters that are purer and less speaker- and gender-specific than when using MFCCs, showing for the first time the benefit of unsupervised representation learning within a complete zero-resource system.

We compared our approach to state-of-the-art baselines on both languages. We found that, although the isolated patterns discovered by UTD are more consistent, the clusters of our full-coverage approach are better matched to true words, measured in terms of word token, type and boundary $F$ -scores. We also found that by proposing a consistent segmentation and clustering over whole utterances, our approach outperforms a purely bottom-up syllable-based full-coverage system on these metrics.

The high WERs reported in this study show that there is still much work to be done in the area of zero-resource speech processing. Nevertheless, previous work shows that high-error rate unsupervised systems can still be useful in downstream tasks. The analysis presented here also provides useful baselines and guidance for future work. In particular, we show the benefits of performing consistent top-down segmentation while adhering to bottom-up constraints, as well as incorporating unsupervised representation learning. Our own future work will consider better acoustic word embedding approaches, improving the recall of the syllabic presegmentation method, and improving the overall efficiency of the model.

Acknowledgements

We would like to thank Okko Räsänen and Shreyas Seshadri for providing the code for their syllable boundary detection algorithm and for regenerating their ZRS results. We also thank Roland Thiollière and Maarten Versteegh for providing us the alignments used in the ZRS challenge. HK was funded by a Commonwealth Scholarship. This work was supported in part by a James S. McDonnell Foundation Scholar Award to SG.

References

Appendices

Appendix A Posterior predictive of spherical Gaussian

Because of the conjugate priors with known spherical covariance matrices, the probability density function (PDF) of the multivariate posterior predictive $p(\boldsymbol{\mathbf{x}}_{i}|\mathcal{X}_{k\backslash i};\boldsymbol{\mathbf{\beta}})$ in (5) is itself a spherical covariance Gaussian. This PDF decomposes into the product of univariate PDFs; for a single dimension $x_{i}$ of vector $\boldsymbol{\mathbf{x}}_{i}$ , the univariate PDF is given by

and $\overline{x}_{k\backslash i}$ is component $k$ ’s sample mean for this dimension .

Appendix B Forward filtering backward sampling for word segmentation

To sample the new set of embeddings in line 5 of Algorithm 1, the forward filtering backward sampling dynamic programming algorithm is used . Forward variable $\alpha[t]$ is defined as the density of the frame sequence $\boldsymbol{\mathbf{y}}_{1:t}$ , with the last frame the end of a word: $\alpha[t]\triangleq p(\boldsymbol{\mathbf{y}}_{1:t}|{h^{-}})$ . The embeddings and component assignments for all words not in the current utterance $\boldsymbol{\mathbf{s}}_{i}$ , and the hyperparameters of the GMM, are denoted as $h^{-}=(\mathcal{X}_{\backslash s},\boldsymbol{\mathbf{z}}_{\backslash s};a,\boldsymbol{\mathbf{\beta}})$ . The forward variables can be recursively calculated as :

starting with $\alpha=1$ and calculating (9) for $1\leq t\leq M-1$ . The $p(\boldsymbol{\mathbf{y}}_{{t-j+1}:t}|h^{-})$ term in (9) is the value of a joint probability density function (PDF) over acoustic frames $\boldsymbol{\mathbf{y}}_{{t-j+1}:t}$ . In analogy to a frame-based supervised model where this term would be calculated as the product of the PDF values of a GMM for all the frames involved, we define this term as

where $\boldsymbol{\mathbf{x}}^{\prime}=f_{e}(\boldsymbol{\mathbf{y}}_{{t-j+1}:t})$ is the acoustic word embedding calculated on the segment. Thus, as in the frame-based supervised case, each frame is assigned a PDF score; but in this case, all $j$ frames in the segment are assigned the PDF value of the whole segment under the current acoustic model. The required marginal term in (10) can be calculated as:

with the two terms in the summation calculated in the same way as those in (5).

Once all $\alpha$ ’s have been calculated, a segmentation can be sampled backwards. Starting from the final positition $t=M$ , we sample the preceding word boundary position using :

Variable $q_{t}$ is the number of frames that we need to move backwards from position $t$ to find the preceding word boundary. We calculate (12) for $1\leq j\leq t$ and sample while $t-j\geq 1$ .

Appendix C Tables of complete results for all systems and metrics

In Section 4.4, several variants of our approach were considered. In Section 4.5, a subset of these were compared to other systems evaluated in the context of the Zero Resource Speech Challenge 2015 (ZRS) , using a subset of the challenge metrics. Tables 7 and 8 give the performance of all variants of our system on all the ZRS metrics on the English and Xitsonga data, respectively.