Single-Channel Multi-Speaker Separation using Deep Clustering

Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, John R. Hershey

Introduction

The human auditory system gives us the extraordinary ability to converse in the midst of a noisy throng of party goers. Solving this so-called cocktail party problem has proven extremely challenging for computers, and separating and recognizing speech in such conditions has been the holy grail of speech processing for more than 50 years. Previously, no practical method existed that could solve the problem in general conditions, especially in the case of single channel speech mixtures.

This work builds upon recent advances in single-channel separation, using a method known as deep clustering . In deep clustering, a neural network is trained to assign an embedding vector to each element of a multi-dimensional signal, such that clustering the embeddings yields a desired segmentation of the signal. In the cocktail-party problem, the embeddings are assigned to each time-frequency (TF) index of the short-time Fourier transform (STFT) of the mixture of speech signals. Clustering these embeddings yields an assignment of each TF bin to one of the inferred sources. These assignments are used as a masking function to extract the dominant parts of each source. Preliminary work on this method produced remarkable performance, improving SNR by 6 dB on the task of separating two unknown speakers from a single-channel mixture .

In this paper we present improvements and extensions that enable a leap forward in separation quality, reaching levels of improvement that were previously out of reach (audio examples and scripts to generate the data used here are available at ). In addition to improvements to the training procedure, we investigate the three speaker case, showing generalization between two- and three-speaker networks.

The original deep clustering system was intended to only recover a binary masks for each source, leaving recovery of the missing features to subsequent stages. In this paper, we incorporate enhancement layers to refine the signal estimate. Using soft clustering, we can then train the entire system end-to-end, training jointly through the deep clustering embeddings, the clustering and enhancement stages. This allows us to directly use a signal approximation objective instead of the original mask-based deep clustering objective.

Prior work in this area includes auditory grouping approaches to computational auditory scene analysis (CASA) . These methods used hand-designed features to cluster the parts of the spectrum belonging to the same source. Their success was limited by the lack of a machine learning framework. Such a framework was provided in subsequent work on spectral clustering , at the cost of prohibitive complexity.

Generative models have also been proposed, beginning with . In constrained tasks, super-human speech separation was first achieved using factorial HMMs and was extended to speaker-independent separation . Variants of non-negative matrix factorization and Bayesian non-parametric models have also been used. These methods suffer from computational complexity and difficulty of discriminative training.

In contrast to the above, deep learning approaches have recently provided fast and accurate methods on simpler enhancement tasks . These methods treat the mask inferance as a classification problem, and hence can be discriminatively trained for accuracy without sacrificing speed. However they fail to learn in the speaker-independent case, where sources are of the same class , despite the work-around of choosing the best permutation of the network outputs during training. We call this the permutation problem: there are multiple valid output masks that differ only by a permutation of the order of the sources, so a global decision is needed to choose a permutation.

Deep clustering solves the permutation problem by framing mask estimation as a clustering problem. To do so, it produces an embedding for each time-frequency element in the spectrogram, such that clustering the embeddings produces the desired segmentation. The representation is thus independent of permutation of the source labels. It can also flexibly represent any number of sources, allowing the number of inferred sources to be decided at test time. Below we present the deep clustering model and further investigate its capabilities. We then present extensions to allow end-to-end training for signal fidelity.

The results are evaluated using an automatic speech recognition model trained on clean speech. The end-to-end signal approximation produces unprecedented performance, reducing the word error rate (WER) from close to 89.1% WER down to 30.8% by using the end-to-end training. This represents a major advancement towards solving the cocktail party problem.

Deep Clustering Model

Here we review the deep clustering formalism presented in . We define as $x$ a raw input signal and as $X_{i}=g_{i}(x),i\in\{1,\dots,N\},$ a feature vector indexed by an element $i$ . In audio signals, $i$ is typically a TF index $(t,f)$ , where $t$ indexes frame of the signal, $f$ indexes frequency, and $X_{i}=X_{t,f}$ the value of the complex spectrogram at the corresponding TF bin. We assume that the TF bins can be partitioned into sets of TF bins in which each source dominates. Once estimated, the partition for each source serves as a TF mask to be applied to $X_{i}$ , yielding the TF components of each source that are uncorrupted by other sources. The STFT can then be inverted to obtain estimates of each isolated source. The target partition in a given mixture is represented by the indicator $Y=\{y_{i,c}\}$ , mapping each element $i$ to each of $C$ components of the mixture, so that $y_{i,c}=1$ if element $i$ is in cluster $c$ . Then $A=YY^{T}$ is a binary affinity matrix that represents the cluster assignments in a permutation-independent way: $A_{i,j}=1$ if $i$ and $j$ belong to the same cluster and $A_{i,j}=0$ otherwise, and $(YP)(YP)^{T}=YY^{T}$ for any permutation matrix $P$ .

Improvements to the Training Recipe

We investigated several approaches to improve performance over the baseline deep clustering method, including regularization such as drop-out, model size and shape, and training schedule. We used the same feature extraction procedure as in , with log-magnitude STFT features as input, and we performed global mean-variance normalization as a pre-processing step. For all experiments we used we used rmsprop optimization with a fixed learning rate schedule, and early stopping based on cross-validation.

Regularizing recurrent network units: Recurrent neural network (RNN) units, in particular LSTM structures, have been widely adopted in many tasks such as object detection, natural language processing, machine translation, and speech recognition. Here we experiment with regularizing them using dropout.

LSTM nodes consist of a recurrent memory cell surrounded by gates controlling its input, output, and recurrent connections. The direct recurrent connections are element-wise and linear with weight 1, so that with the right setting of the gates, the memory is perpetuated, and otherwise more general recurrent processing is obtained.

Dropout is a training regularization in which nodes are randomly set to zero. In recurrent network there is a concern that dropout could interfere with LSTM’s memorization ability; for example, used it only on feed-forward connections, but not on the recurrent ones. Recurrent dropout samples the set of dropout nodes once for each sequence, and applies dropout to the same nodes at every time step for that sequence. Applying recurrent dropout to the LSTM memory cells recently yielded performance improvements on phoneme and speech recognition tasks with BLSTM acoustic models .

In this work, we sampled the dropout masks once at each time step for the forward connections, and only once for each sequence for the recurrent connections. We used the same recurrent dropout mask for each gate.

Architecture: We investigated using deeper and wider architectures. The neural network model used in was a two layer bidirectional long short-term memory (BLSTM) network followed by a feed-forward layer to produce embeddings. We show that expanding the network size improves performance for our task.

Temporal context: During training, the utterances are divided into fixed length non-overlapping segments, and gradients are computed using shuffled mini-batches of these segments, as in . Shorter segments increase the diversity within each batch, and may make an easier starting point for training, since the speech does not change as much over the segment. However, at test time, the network and clustering are given the entire utterance, so that the permutation problem can be solved globally. So we may also expect that training on longer segments would improve performance in the end.

In experiments below, we investigate training segment lengths of 100 versus 400, and show that although the longer segments work better, pretraining with shorter segments followed by training with longer segments leads to better performance on this task. This is an example of curriculum learning , in which starting with an easier task improves learning and generalization.

Multi-speaker training: Previous experiments showed preliminary results on generalization from two speaker training to a three-speaker separation task. Here we further investigate generalization from three-speaker training to two-speaker separation, as well as multi-style training on both two and three-speaker mixtures, and show that the multi-style training can achieve the best performance on both tasks.

Optimizing Signal Reconstruction

End-to-End Training

In order to consider end-to-end training in the sense of jointly training the deep clustering with the enhancement stage, we need to compute gradients of the clustering step. In , hard $K$ -means clustering was used to cluster the embeddings. The resulting binary masks cannot be directly optimized to improve signal fidelity, because the optimal masks are generally continuous, and because the hard clustering is not differentiable. Here we propose a soft $K$ -means algorithm that enables us to directly optimize the estimated speech for signal fidelity.

In , clustering was performed with equal weights on the TF embeddings, although weights were used in the training objective in order to train only on TF elements with significant energy. Here we introduce similar weights weights $w_{i}$ for each embedding $v_{i}$ to focus the clustering on TF elements with significant energy. The goal is mainly to avoid clustering silence regions, which may have noisy embeddings, and for which mask estimation errors are inconsequential.

The soft weighted $K$ -means algorithm can be interpreted as a weighted expectation maximization (EM) algorithm for a Gaussian mixture model with tied circular covariances. It alternates between computing the assignment of every embedding to each centroid, and updating the centroids:

where $\mu_{c}$ is the estimated mean of cluster $c$ , and $\gamma_{i,j}$ is the estimated assignment of embedding $i$ to the cluster $c$ . The parameter $\alpha$ controls the hardness of the clustering. As the value of $\alpha$ increases, the algorithm approaches $K$ -means.

The weights $w_{i}$ may be set in a variety of ways. A reasonable choice could be to set $w_{i}$ according to the power of the mixture in each TF bin. Here we set the weights to $1$ , except in silence TF bins where the weight is set to . Silence is defined using a threshold on the energy relative to the maximum of the mixture.

End-to-end training is performed by unfolding the steps of (2), and treating them as layers in a clustering network, according to the general framework known as deep unfolding . The gradients of each step are thus passed to the previous layers using standard back-propagation.

Experiments

Experimental setup: We evaluate deep clustering on a single-channel speaker-independent speech separation task, considering mixtures of two and three speakers with all gender combinations. For two-speaker experiments, we use the corpus introduced in , derived from the Wall Street Journal (WSJ0) corpus. It consists in a 30 h training set and a 10 h validation set with two-speaker mixtures generated by randomly selecting utterances by different speakers from the WSJ0 training set si_tr_s, and mixing them at various signal-to-noise ratios (SNR) randomly chosen between 0 dB and 10 dB. The validation set was here used to optimize some tuning parameters. The 5 h test set consists in mixtures similarly generated using utterances from 16 speakers from the WSJ0 development set si_dt_05 and evaluation set si_et_05. The speakers are different from those in our training and validation sets, leading to a speaker-independent separation task. For three-speaker experiments, we created a corpus similar to the two-speaker one, with the same amounts of data generated from the same datasets. All data were downsampled to 8 kHz before processing to reduce computational and memory costs. The input features $X$ were the log spectral magnitudes of the speech mixture, computed using a short-time Fourier transform (STFT) with a 32 ms sine window and 8 ms shift.

The initial system, based on , trains a deep clustering model on 100-frame segments from the two-speaker mixtures. The network, with 2 BLSTM layers, each having 300 forward and 300 backward LSTM cells, is denoted as $300\!\times\!2$ . The learning rate for the rmsprop algorithm was $\lambda=0.001\times(1/2)^{\lfloor\epsilon/50\rfloor}$ , where $\epsilon$ is the epoch number.

Regularization: We first considered improving performance of the baseline using common regularization practices. Table 2 shows the contribution of dropout ( $p=0.5$ ) on feed-forward connections, recurrent dropout ( $p=0.2$ ), and gradient normalization ( $|\nabla|\leq 200$ ), where the parameters were tuned on development data. Together these result in a 3.3 dB improvement in SDR relative to the baseline.

Architecture: Various network architectures were investigated by increasing the number of hidden units and number of BLSTM layers, as shown in Table 3. An improvement of 9.4 dB SDR was obtained with a deeper $300\times 4$ architecture, with 4 BLSTM layers and 300 units in each LSTM.

Pre-training of temporal context: Training the model with segments of 400 frames, after pre-training using 100-frame segments, boosts performance to 10.3 dB, as shown in Table 4, from $9.9$ dB without pre-training. Results for the remaining experiments are based on the pre-trained $300\!\times\!4$ model.

Multi-speaker training: We train the model further with a blend of two- and three-speaker mixtures. For comparison, we also trained a model using only three-speaker mixtures, again training first over 100-frame segments, then over 400-frame segments. The performance of the models trained on two-speaker mixtures only, on three-speaker mixtures only, and using the multi-speaker training, are shown in Table 5. The three-speaker mixture model seems to generalize better to two speakers than vice versa, whereas the multi-speaker trained model performed the best on both tasks.

Soft clustering: The choice of the clustering hardness parameter $\alpha$ and the weights on TF bins is analyzed on the validation set, with results in Table 6. The use of weights to ignore silence improves performance with diminishing returns for larg $\alpha$ . The best result is for $\alpha=5$ .

End-to-end training: Finally, we investigate end-to-end training, using a second-stage enhancement network on top of the deep clustering (‘dpcl’) model. Our enhancement network features two BLSTM layers with 300 units in each LSTM layer, with one instance per source followed by a soft-max layer to form a masking function. We first trained the enhancement network separately (‘dpcl + enh’), followed by end-to-end fine-tuning in combination with the dpcl model (‘end-to-end’). Table 7 shows the improvement in SDR as well as magnitude SNR (SNR computed on the magnitude spectrograms).

The magnitude SNR is insensitive to phase estimation errors introduced by using the noisy phases for reconstruction, whereas the SDR might get worse as a result of phase errors, even if the amplitudes are accurate. Speech recognition uses features based on the amplitudes, and hence the improvements in magnitude SNR seem to predict the improvements in WER due to the enhancement and end-to-end training. Fig. 1 shows that the SDR improvements of the end-to-end model are consistently good on nearly all of the two-speaker test mixtures.

ASR performance: We evaluated ASR performance (WER) with GMM-based clean-speech WSJ models obtained by a standard Kaldi recipe . The noisy baseline result on the mixtures is 89.1 %, while the result on the clean speech is 19.9 %. The raw output from dpcl did not work well, despite good perceptual quality, possibly due to the effect of near-zero values in the masked spectrum, which is known to degrade ASR performance. However, the enhancement networks significantly mitigated the degradation, and finally obtained 30.8 % with the end-to-end network.

Visualization: To gain insight into network functioning, we performed reverse correlation experiments. For each node, we average the 50-frame patches of input centered at the time when the node is active (e.g., the node is at 80% of its maximum value). Fig. 2 shows a variety of interesting patterns, which seem to reflect such properties as onsets, pitch, frequency chirps, and vowel-fricative transitions.