A Fully Convolutional Neural Network for Speech Enhancement
Se Rim Park, Jinwon Lee
Introduction
Denoising speech signals has been a long standing problem. Decades of works showed feasible solutions which estimated the noise model and used it to recover noise-deducted speech . Nonetheless, estimating the model for a babble noise, which is encountered when a crowd of people are talking, is still a challenging task.
The presence of babble noise, however, degrades hearing intelligibility of human speech greatly. When babble noise dominates over speech, aforementioned methods often times will fail to find the correct noise model . If so, the noise-deduction will render distortion in speech, which creates discomforts to the users of hearing aids .
Here, instead of explicitly modeling the babble noise, we focus on learning a ‘mapping’ between noisy speech spectra and clean speech spectra, inspired by recent works on speech enhancement using neural networks . However, the model size of Neural Networks easily exceeds several hundreds of megabytes, limiting its applicability for an embedded system.
On the other hand, Convolutional Neural Networks (CNN) typically consist of lesser number of parameters than FNNs and RNNs due to its weight sharing property. CNNs already proved its efficacy on extracting features in speech recognition or on eliminating noises in images . But upon our knowledge, CNNs have not been tested in speech enhancement.
In this paper, we attempted to find a ‘memory efficient’ denoising algorithm for babble noise that creates minimal artifacts and that can be implemented in an embedded device: the hearing aid. Through experiments, we demonstrated that CNN can perform better than Feedforward Neural Networks (FNN) or Recurrent Neural Networks (RNN) with much smaller network size. A new network architecture, Redundant Convolutional Encoder Decoder (R-CED), is proposed, which extracts redundant representations of a noisy spectrum at the encoder and map it back to clean a spectrum at the decoder. This can be viewed as mapping the spectrum to higher dimensions (e.g. kernel method), and projecting the features back to lower dimensions.
The paper is organized as follows. In section 2, a formal definition of the problem is stated. In section 3, the fully convolutional network architectures are presented including the proposed R-CED network. In section 4, the experimental methods are provided. In section 5, description of the experiments and the corresponding network configurations are provided. In section 6, the results are discussed, and in section 7, we end with conclusion of this study.
Problem Statement
Specifically, we formulate using a Neural Network (see Fig.1). If is a recurrent type network, the temporal behavior of input spectra is already addressed by the network, and hence objective (1) suffices. On the other hand, for a convolutional type network, the past noisy spectra are considered to denoise the current spectra, e.g.
We set . Hence, input spectra to the network is equivalent to about 100ms of speech segment, whereas the output spectra of the network is of duration 32ms (see Fig.3, Fig.3).
Convolutional Network Architectures
Convolutional Encoder-Decoder (CED) network proposed in consists of symmetric encoding layers and decoding layers (see Fig.3, each block represents a feature). Encoder consists of repetitions of a convolution, batch-normalization , max-pooling, and an ReLU activation layer. Decoder consists of repetitions of a convolution, batch-normalization, and an up-sampling layer. Typically, CED compresses the features along the encoder, and then reconstructs the features along the decoder. In our problem, the orignal Softmax layer at the last layer is modified to a convolution layer, to make CED fully convolutional network.
2 Redundant CED Network (R-CED)
Here, we propose an alternative convolutional network architecture, namely Redundant Convolutional Encoder-Decoder (R-CED) network. R-CED consists of repetitions of a convolution, batch-normalization, and a ReLU activation layer (see Fig.3, each block represents a feature). No pooling layer is present, and thus no upsampling layer is required. As opposite to CED, R-CED encodes the features into higher dimension along the encoder and achieves compression along the decoder. The number of filters are kept symmetric: at the encoder, the number of filters are gradually increased, and at the decoder, the number of filters are gradually decreased. The last layer is a convolution layer, which makes R-CED a fully convolutional network.
Cascaded R-CED Network (CR-CED): Cascaded Redundant Convolutional Encoder-Decoder Network (CR-CED) is a variation of R-CED network. It consists of repetitions of R-CED Networks. Compared to the R-CED with the same network size (= the same number of parameters), CR-CED achieves better performance with less convergence time.
3 Bypass Connections
For CED, R-CED, and CR-CED, bypass connections are added to the network to facilitate the optimization in the training phase and improve performance. Between two different bypass schemes — skip connections in and residual connections in — we chose to use skip connections in which is more suitable for symmetric encoder-decoder design. Bypass connections are illustrated in Fig.3 and Fig.3 as an ‘addition’ operation symbol with an arrow. Bypass connections are added every other layer.
4 1-Dim Convolution Operation for Convolution Layers
At all convolution layers throughout the paper, convolution was performed only in 1 direction (see Fig.5). In Fig.5, the input ( white matrix) and the filter ( blue matrix) has the same dimension in time axis, and convolution is performed in frequency axis. We found this more efficient than 2-dim convolution (see Fig.5) for our input spectra ().
Experimental Methods
Dataset: The experiment was conducted on the TIMIT database and 27 different types of noise clips were collected from freely available online resource . The noise are mostly babble, but includes different types of noise like instrumental sounds. Both data in the training set (4620 utterances) and the testing set (200 utterances) were added with one of 27 noise clips at 0dB SNR. After all feature transformation steps were completed, 20% of the training features were assigned as the validation set.
Feature Transformation: The audio signals were down-sampled to 8kHz, and the silent frames were removed from the signal. The spectral vectors were computed using a 256-point Short Time Fourier Transform (32ms hamming window) with a window shift of 64-point (8ms). The frequency resolution was 31.25 Hz (=4kHz/128) per each frequency bin. 256-point STFT magnitude vectors were reduced to 129-point by removing the symmetric half. For FNN/RNN, the input feature consisted of a noisy STFT magnitude vector (size: 1291, duration: 32ms). For CNN, the input feature consisted of 8 consecutive noisy STFT magnitude vectors (size: , duration: 100ms). Both input features were standardized to have zero mean and and unit variance.
Phase Aware Scaling: To avoid extreme differences (more than 45 degree) between the noisy and clean phase, the clean spectral magnitude was encoded as similar to :
Besides, spectral phase was not used in the training phase. At reconstruction, noisy spectral phase was used instead to perform inverse STFT and recover human speech. However, because human ear is not susceptible to phase difference smaller than 45 degree, the distortion was negligible. Through ‘phase aware scaling’, the phase mismatch be smaller than 45 degree, and For all networks, the output feature consisted of a ‘phase aware’ magnitude vector (size: 1291, duration: 32ms), and were standardized to have zero mean and and unit variance.
2 Optimization
3 Evaluation Metric
SDR is inversely associated with the objective function presented in (1). In addition, Short time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Distortion (PESQ) — both assume that human perception has short term memory and hence the error is measured nonlinearly in time of interest— were used to measure the subjective quality of listening.
Experimental Setup
The first experiment compared CNN with FNN and RNN to demonstrate how feasible it is to use CNN for speech enhancement. Network configurations (e.g. number of nodes, number of layers) that yielded the best performance for each network are summarized in Table.2. The best FNN and RNN architectures have 4 fully connected (FC) layers whereas CNN has 16 convolutional layers.
2 Test 2: CED vs. R-CED
In the second experiment, R-CED was compared to CED. For fair comparison, the total number of parameters were fixed to 33,000 (roughly 132MB of memory) while the depth of the network was fixed to 10 convolution layers. The filter width per each layer is determined accordingly such that i) the symmetric encoder-decoder structure is maintained, ii) the number of parameters are gradually increased and decreased, iii) the ‘frequency coverage’ is equal for both network. Here, ‘frequency coverage’ refers to how much nearby frequency bins at the input are used to reconstruct a single frequency bin at the output. We made sure that both networks utilize the same amount of frequency bins to reconstruct a single frequency bin. The configurations for Test 2 is summarized at top two rows of Table.1.
3 Test 3: Finding the Best R-CED Performance
In the third experiment, we tested how far the performance can be improved using the R-CED network. The R-CED and CR-CED network with skip connections of various network size and depth are compared. The network size (the number of parameters) considered are 33K (132MB memory) and 100K (400MB memory). The depth of the network considered are 10, 16, and 20 convolution layers. Bottom 3 rows in Table.1 summarize the network configurations for Test 3.
Results
Fig.9 illustrates the denoising performance of FNN, RNN and CNN (left), and the corresponding network size (=the number of parameters, right). All networks exhibited similar performance based on both subjective (STOI, PESQ) and objective quality (SDR) measure. On the other hand, the model size of CNN was about 68 times smaller than that of FNN and about 12 times smaller than RNN. We note that FNN and RNN were optimized to have the smallest network architectures. This experiment validates that CNN requires far lesser number of parameters per layer due to its weight sharing property, and yet can achieve similar or better performance compared to FNN and RNN. 33,000 parameters for CNN are roughly 132MB of memory which can be implemented in an embedded system. Refer to Fig.8, Fig.8, Fig.8 for the example of noisy spectrogram, clean spectrogram, and denoised spectrogram from CNN respectively.
2 Test 2: CED vs. R-CED
The denoising performance of CED and R-CED are shown in Fig.10 with first 4 bars. The R-CED with skip connections showed the best performance, whereas the CED without skip connections showed the worst performance. Regardless of the presence of skip connections, R-CED yielded better results than CED.
The effect of the skip connection was prominent in CED (5.96 to 7.92). This implies that the decoder itself could not reconstruct the ‘lost’ information compressed at the encoder, unless the ‘lost’ information was provided by the skip connections. In addition, the resulting speech from CED sounded artificial and mechanical. This confirms that the decoder could not reconstruct what is necessary for audios to sound like human speech.
On the other hand, the effect of the skip connection was not that notable in R-CED (8.07 to 8.19). This is because R-CED rather expands than compresses the input at the encoder, which can be viewed as mapping a spectrum to higher dimension (e.g. kernel method). By generating redundant representations of important features at the encoder, and removing unwanted features at the decoder, the speech quality was effectively enhanced.
3 Test 3: Finding the Best R-CED Network
The denoising performance of R-CED of various network size and depth are presented in Fig.10 with the last five bars. A few interesting observations are i) the network size was the most dominant factor that was associated with the network performance, ii) the network depth was secondary, iii) CR-CED with skip connection yielded the best performance when other conditions were kept same (16 convolution layers, 33K parameters).
Conclusion
In this paper, we aimed to find a memory efficient denoising method that can be implemented in an embedded system. Inspired by past success of FNN and RNN, we hypothesized that CNN can effectively denoise speech with smaller network size according to its weight sharing property. We set up an experiment to denoise human speech from babble noise which is a major discomfort to the users of hearing aids. Through experiments, we demonstrated that CNN can yield similar or better performance with much lesser number of model parameters compared to FNN and RNN. Also, we proposed a new fully convolutional network architecture R-CED, and showed its efficacy in speech enhancement. We observed that the success of R-CED is associated with the increasing dimension of the feature space along the encoder and decreasing dimension along the decoder. We expect that R-CED can be also applied to other interesting domains. Our future work will include pruning the R-CED to minimize the operation count of convolution.