Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

Titouan Parcollet, Ying Zhang, Mohamed Morchid, Chiheb Trabelsi, Georges Linarès, Renato De Mori, Yoshua Bengio

Introduction

Recurrent (RNN) and convolutional (CNN) neural networks have improved the performance over hidden Markov models (HMM) combined with gaussian mixtures models (GMMs) in automatic speech recognition (ASR) systems during the last decade. More recently, end-to-end approaches received a growing interest due to the promising results obtained with connectionist temporal classification (CTC) combined with RNNs or CNNs . However, despite such evolution of models and paradigms, the acoustic features remain almost the same. The main motivation is that filters spaced linearly at low frequencies and logarithmically at high frequencies make it possible to capture phonetically important acoustic correlates. Early evidence was provided in showing that mel frequency scaled cepstral coefficients (MFCCs) are effective in capturing the acoustic information required to recognize syllables in continuous speech. Motivated by these analysis, a small number of MFCCs (usually $13$ ) with their first and second time-derivatives, as proposed in , have been found suited for statistical and neural ASR systems. In most systems, a time frame of the speech signal is represented by a vector with real-valued elements that express sequences of MFCCs, or filter energies, and their temporal context features. A concern addressed in this paper, is the fact that the relations between different views of the features associated with a frequency are not explicitly represented in the feature vectors used so far. Therefore, this paper proposes to:

Introduce a new quaternion representation (Section 2) to encode multiple views of a time-frame frequency in which different views are encoded as values of imaginary parts of a hyper-complex number. Thus, vectors of quaternions are embedded using operations defined by a specific quaternion algebra to preserve a distinction between features of each frequency representation.

Merge a quaternion convolutional neural network (QCNN, Section 3) with the CTC in a unified and easily reusable frameworkThe full code is available at https://git.io/vx8so.

Compare and evaluate the effectiveness of the proposed QCNN to an equivalent real-valued model on the TIMIT phonemes recognition task (Section 4).

There are advantages which could derive from bundling groups of numbers into a quaternion. Like capsule networks , quaternion networks create a tighter association between small groups of numbers rather than having one homogeneous representation. In addition, this kind of structure reduces the number of required parameters considerably, because only one weight is necessary between two quaternion units, instead of 4 $\times 4=16$ . The hypothesis tested here is whether these advantages lead to better generalization. The conducted experiments on the TIMIT dataset yielded a phoneme error rate (PER) of $19.64$ % for QCNNs which is significantly lower than the PER obtained with real-valued CNNs ( $20.57$ %), with the same input features. Moreover, from a practical point of view, the resulting networks have a considerably smaller memory footprint due to a smaller set of parameters.

Quaternion algebra

In a quaternion, $r$ is the real part while $x\textbf{i}+y\textbf{j}+z\textbf{k}$ is the imaginary part ( $I$ ) or the vector part. Basic quaternion definitions are

all products of $\textbf{i},\textbf{j}$ ,k are: $\textbf{i}^{2}=\textbf{j}^{2}=\textbf{k}^{2}=\textbf{i}\textbf{j}\textbf{k}=-1$ ,

conjugate $Q^{*}$ of $Q$ is: $Q^{*}=r1-x\textbf{i}-y\textbf{j}-z\textbf{k}$ ,

unit quaternion $Q^{\triangleleft}=\frac{Q}{\sqrt{r^{2}+x^{2}+y^{2}+z^{2}}}$ ,

the Hamilton product $\otimes$ between $Q_{1}$ and $Q_{2}$ is defined as follows:

Quaternion convolutional neural networks

This section defines the internal quaternion representation (Section 3.1), the quaternion convolution (Section 3.2), a proper parameter initialization (Section 3.3), and the connectionist temporal classification (Section 3.4).

The QCNN is a quaternion extension of well-known real-valued and complex-valued deep convolutional networks (CNN) . The quaternion algebra is ensured by manipulating matrices of real numbers. Consequently, a traditional $2D$ convolutional layer, with a kernel that contains $N$ feature maps, is split into 4 parts: the first part equal to $r$ , the second one to $x\textbf{i}$ , the third one to $y\textbf{j}$ and the last one to $z\textbf{k}$ of a quaternion $Q=r1+x\textbf{i}+y\textbf{j}+z\textbf{k}$ . Nonetheless, an important condition to perform backpropagation in either real, complex or quaternion neural networks is to have cost and activation functions that are differentiable with respect to each part of the real, complex or quaternion number. Many activation functions for quaternion have been investigated and a quaternion backpropagation algorithm have been proposed in . Consequently, the split activation function is applied to every layer and is defined as follows:

with $\alpha$ corresponding to any standard activation function.

2 Quaternion-valued convolution

Following a recent proposition for convolution of complex numbers and quaternions , this paper presents basic neural networks convolution operations using quaternion algebra. The convolution process is defined in the real-valued space by convolving a filter matrix with a vector. In a QCNN, the convolution of a quaternion filter matrix with a quaternion vector is performed. For this computation, the Hamilton product is computed using the real-valued matrices representation of quaternions. Let $W=R+X\textbf{i}+Y\textbf{j}+Z\textbf{k}$ be a quaternion weight filter matrix, and $X_{p}=r+x\textbf{i}+y\textbf{j}+z\textbf{k}$ the quaternion input vector. The quaternion convolution w.r.t the Hamilton product $W\otimes X_{p}$ is defined as follows:

and can thus be expressed in a matrix form:

An illustration of such operation is depicted in Figure 1.

3 Weight initialization

Weight initialization is crucial to efficiently train neural networks. An appropriate initialization improves training speed and reduces the risk of exploding or vanishing gradient. A quaternion initialization is composed of two steps. First, for each weight to be initialized, a purely imaginary quaternion $q_{imag}$ is generated following an uniform distribution in the interval $ $. The imaginary unit is then normalized to obtain$ q_{imag}^{\triangleleft} $following the quaternion normalization equation. The later is used alongside to other well known initializing criterion such as or to complete the initialization process of a given quaternion weight named$ w$. Moreover, the generated weight has a polar form defined by :

$w_{\textbf{r}}=\phi*q^{\triangleleft}_{imag\textbf{r}}*cos(\theta)$ ,

$w_{\textbf{i}}=\phi*q^{\triangleleft}_{imag\textbf{i}}*sin(\theta)$ ,

$w_{\textbf{j}}=\phi*q^{\triangleleft}_{imag\textbf{j}}*sin(\theta)$ ,

$w_{\textbf{k}}=\phi*q^{\triangleleft}_{imag\textbf{k}}*sin(\theta)$ .

However, $\phi$ represents a randomly generated variable with respect to the variance of the quaternion weight and the selected initialization criterion. The initialization process follows and to derive the variance of the quaternion-valued weight parameters. Therefore, the variance of W has to be investigated:

Therefore, in order to respect the He Criterion , the variance would be equal to:

4 Connectionist Temporal Classification

In the acoustic modeling part of ASR systems, the task of sequence-to-sequence mapping from an input acoustic signal $X=[x_{1},...,x_{n}]$ to a sequence of symbols $T=[t_{1},...,t_{m}]$ is complex due to:

$X$ and $T$ could be in arbitrary length.

The alignment between $X$ and $T$ is unknown in most cases.

Specially, $T$ is usually shorter than $X$ in terms of phoneme symbols.

To alleviate these problems, connectionist temporal classification (CTC) has been proposed . First, a softmax is applied at each timestep, or frame, providing a probability of emitting each symbol $X$ at that timestep. This probability results in a symbol sequences representation $P(O|X)$ , with $O=[o_{1},...,o_{n}]$ in the latent space $O$ . A blank symbol ${}^{\prime}-^{\prime}$ is introduced as an extra label to allow the classifier to deal with the unknown alignment. Then, $O$ is transformed to the final output sequence with a many-to-one function $g(O)$ defined as follows:

Consequently, the output sequence is a summation over the probability of all possible alignments between $X$ and $T$ after applying the function $g(O)$ . Accordingly to the parameters of the models are learned based on the cross entropy loss function:

During the inference, a best path decoding algorithm is performed. Therefore, the latent sequence with the highest probability is obtained by performing argmax of the softmax output at each timestep. The final sequence is obtained by applying the function $g(.)$ to the latent sequence.

Experiments

The performance and efficiency of the proposed QCNNs is evaluated on a phoneme recognition task. This section provides details on the dataset and the quaternion features representation (Section 4.1), the models configurations (Section 4.2), and finally a discussion of the observed results (Section 4.3).

The TIMIT dataset is composed of a standard 462-speaker training dataset, a 50-speakers development dataset and a core test dataset of $192$ sentences. During the experiments, the SA records of the training set are removed and the development set is used for early stopping. The raw audio is transformed into $40$ -dimensional log mel-filter-bank coefficients with deltas, delta-deltas, and energy terms, resulting in a one dimensional vector of length $123$ . An acoustic quaternion $Q(f,t)$ associated with a frequency $f$ and a time frame $t$ is defined as follows:

It represents multiple views of a frequency $f$ at time frame $t$ , consisting of the energy $e(f,t)$ in the filter band corresponding to $f$ , its first time derivative describing a slope view, and its second time derivative describing a concavity view. Finally, a unique quaternion is composed with the three corresponding energy terms. Thus, the quaternion input vector length is $41$ ( $\frac{123}{3}$ ).

2 Models architectures

The architectures of both CNN and QCNN models are inspired by . A first $2$ D convolutional layer is followed by a maxpooling layer along the frequency axis. Then, $n$ $2$ D convolutional layers are included, together with $3$ dense layers of sizes $1024$ and $256$ respectively for real- and quaternion-valued models (with $n\in$ ). Indeed, the output of a dense quaternion-valued layer has $256\times 4=1024$ nodes and is $4$ times larger than the number of units. The filter size is rectangular $(3,5)$ , and a padding is applied to keep the sequence and signal sizes unaltered. The number of feature maps varies from $32$ to $256$ for the real-valued models and from $8$ to $64$ for quaternion-valued models. Indeed, the number of output feature maps is $4$ times larger in the QCNN due to the quaternion convolution, meaning $32$ quaternion-valued feature maps correspond to $128$ real-valued ones. The PReLU activation function is employed for both models . A dropout of $0.3$ and a $L_{2}$ regularization of $1e^{-5}$ are used across all the layers, except the input and output ones. CNNs and QCNNs are trained with the Adam learning rate optimizer and vanilla hyperparameters during $100$ epochs. Then, a fine-tuning process of $50$ epochs is performed with a standard $sgd$ and a learning rate of $1e^{-5}$ . Finally, the standard CTC loss function defined in and implemented in is applied. Experiments are performed on Tesla P100 and Geforce Titan X GPUs.

3 Results and discussion

Results on the phoneme recognition task of the TIMIT dataset are reported in Table 1. It is worth noticing the important difference in terms of the number of learning parameters between real and quaternion valued CNNs. It is easily explained by the quaternion algebra. In the case of a dense layer with $1,024$ input values and $1,024$ hidden units, a real-valued model will have $1,024^{2}\approx 1$ M parameters, while to maintain equal input and output nodes ( $1,024$ ) the quaternion equivalent has $256$ quaternions inputs and $256$ quaternion-valued hidden units. Therefore the number of parameters for the quaternion model is $256^{2}\times 4\approx 0.26$ M. Such a complexity reduction turns out to produce better results and may have other advantages such as a smallest memory footprint while saving NN models. Moreover, the reduction of the number of parameters does not result in poor performance in the QCNN. Indeed, the best PER reported is $19.64$ % from a QCNN with $256$ feature maps and $10$ layers, compared to a PER of $20.57$ % for a real-valued CNN with $64$ feature maps and $10$ layers. It is worth underlying that both model accuracies are increasing with the size and the depth of the neural network. However, bigger real-valued feature maps leads to overfitting. In fact, as shown in Table 1, the best PER for a real-valued model is reached with $64$ ( $20.57$ ) feature maps and decreasing at $128$ ( $20.62$ %) and $256$ ( $21.23$ ). The QCNN does not suffer from such weaknesses due to the smaller density of the neural network and achieved a constant PER improvement alongside with the increasing number of feature maps. Furthermore, QCNNs always performed better than CNNs independently of the model topologies.

With much fewer learning parameters for a given architecture, the QCNN performs always better than the real-valued one on the reported task. In terms of PER, an average relative gain of $3.25$ % (w.r.t CNNs result) is obtained on the testing set. It is also worth recalling that the best PER of $19.64$ % is obtained with just a QCNN without HMMs, RNNs, attention mechanisms, batch normalization, phoneme language model, acoustic data normalization or adaptation. Further improvements can be obtained with exactly the same QCNN by just introducing a new acoustic feature in the real part of the quaternions.

Related work

Early attempts to perform phoneme and phonetic feature recognition with multilayer perceptrons (MLP) were proposed in . A PER of $26.1$ % is reported in using RNNs. More recently, in a Mean-Covariance Restricted Boltzmann Machine (RBM) is used for recognizing phonemes in the TIMIT corpus using RBM for feature extraction. Along this line of research, in an approach called the Connectionist Temporal Classification (CTC) has been developed and can be used without an explicit input-output alignment. Bidirectional RNNs (BRNNs) are used in for processing input data in both directions with two separate hidden layers, which are then composed in an output layer. With standard mel frequency energies, first and second time derivatives a PER of $17.7$ % was obtained. Other recent results with real-valued vectors of similar features are reported in . Other types of quaternion valued neural networks (QNNs) were introduced for encoding RGB color relations in image pixels , and for classifying human/human conversation topics . A quaternion deep convolutional and residual neural network proposed in have shown impressive results on the CIFAR images classification task. However, a specific quaternion is used for each RGB color value as in rather than integrating pixel multiple views as in , and suggested in this paper for an ASR task.

Conclusions

Summary. This paper proposes to integrate multiple acoustic feature views with quaternion hyper complex numbers, and to process these features with a convolutional neural network of quaternions. The phoneme recognition experiments have shown that: 1) Given an equivalent architecture, QCNNs always outperform CNNs with significantly less parameters; 2) QCNNs obtain better results than CNNs with a similar number of learning parameters; 3) The best result obtained with QCNNs is better than the one observed with the real-valued counterpart. This demonstrates the initial intuition that the capability of the Hamilton product to learn internal latent relations helps quaternions-valued neural networks to achieve better results . Limitations and Future Work. So far, traditional acoustic features, such as mel filter bank energies, first and second derivatives have shown that significantly good results can be obtained with a relative small set of input features for a speech time frame. Nevertheless, speech science has shown that other multi-view context-dependent acoustic relations characterize signals of phonemes in context. Future work will attempt to characterize those multi-view features that mostly contribute to reduce ambiguities in representing phoneme events. Furthermore, quaternions-valued RNNs will also be investigated to see if they can contribute to the improvement of recently achieved top of the line results with real number RNNs.

Acknowledgements

The experiments were conducted using Keras . The authors would like to acknowledge the computing support of Compute Canada and the founding support of Orkis, NSERC, Samsung, IBM and CHIST-ERA/FRQ. The authors would like to thank Kyle Kastner and Mirco Ravanelli for their helpful comments.