Quaternion Recurrent Neural Networks

Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Chiheb Trabelsi, Renato De Mori, Yoshua Bengio

Introduction

In the last few years, deep neural networks (DNN) have encountered a wide success in different domains due to their capability to learn highly complex input to output mapping. Among the different DNN-based models, the recurrent neural network (RNN) is well adapted to process sequential data. Indeed, RNNs build a vector of activations at each timestep to code latent relations between input vectors. Deep RNNs have been recently used to obtain hidden representations of speech unit sequences (Ravanelli et al., 2018a) or text word sequences (Conneau et al., 2018), and to achieve state-of-the-art performances in many speech recognition tasks (Graves et al., 2013a; b; Amodei et al., 2016; Povey et al., 2016; Chiu et al., 2018). However, many recent tasks based on multi-dimensional input features, such as pixels of an image, acoustic features, or orientations of 33D models, require to represent both external dependencies between different entities, and internal relations between the features that compose each entity. Moreover, RNN-based algorithms commonly require a huge number of parameters to represent sequential data in the hidden space. Quaternions are hypercomplex numbers that contain a real and three separate imaginary components, perfectly fitting to 33 and 44 dimensional feature vectors, such as for image processing and robot kinematics (Sangwine, 1996; Pei & Cheng, 1999; Aspragathos & Dimitros, 1998). The idea of bundling groups of numbers into separate entities is also exploited by the recent manifold and capsule networks (Chakraborty et al., 2018; Sabour et al., 2017). Contrary to traditional homogeneous representations, capsule and quaternion networks bundle sets of features together. Thereby, quaternion numbers allow neural network based models to code latent inter-dependencies between groups of input features during the learning process with fewer parameters than RNNs, by taking advantage of the Hamilton product as the equivalent of the ordinary product, but between quaternions. Early applications of quaternion-valued backpropagation algorithms (Arena et al., 1994; 1997) have efficiently solved quaternion functions approximation tasks. More recently, neural networks of complex and hypercomplex numbers have received an increasing attention (Hirose & Yoshida, 2012; Tygert et al., 2016; Danihelka et al., 2016; Wisdom et al., 2016), and some efforts have shown promising results in different applications. In particular, a deep quaternion network (Parcollet et al., 2016; 2017a; 2017b), a deep quaternion convolutional network (Gaudet & Maida, 2018; Parcollet et al., 2018), or a deep complex convolutional network (Trabelsi et al., 2017) have been employed for challenging tasks such as images and language processing. However, these applications do not include recurrent neural networks with operations defined by the quaternion algebra. This paper proposes to integrate local spectral features in a novel model called quaternion recurrent neural networkhttps://github.com/Orkis-Research/Pytorch-Quaternion-Neural-Networks (QRNN), and its gated extension called quaternion long-short term memory neural network (QLSTM). The model is proposed along with a well-adapted parameters initialization and turned out to learn both inter- and intra-dependencies between multidimensional input features and the basic elements of a sequence with drastically fewer parameters (Section 3), making the approach more suitable for low-resource applications. The effectiveness of the proposed QRNN and QLSTM is evaluated on the realistic TIMIT phoneme recognition task (Section 4.2) that shows that both QRNN and QLSTM obtain better performances than RNNs and LSTMs with a best observed phoneme error rate (PER) of 18.5%18.5\% and 15.1%15.1\% for QRNN and QLSTM, compared to 19.0%19.0\% and 15.3%15.3\% for RNN and LSTM. Moreover, these results are obtained alongside with a reduction of 3.33.3 times of the number of free parameters. Similar results are observed with the larger Wall Street Journal (WSJ) dataset, whose detailed performances are reported in the Appendix 6.1.1.

Motivations

A major challenge of current machine learning models is to well-represent in the latent space the astonishing amount of data available for recent tasks. For this purpose, a good model has to efficiently encode local relations within the input features, such as between the Red, Green, and Blue (R,G,B) channels of a single image pixel, as well as structural relations, such as those describing edges or shapes composed by groups of pixels. Moreover, in order to learn an adequate representation with the available set of training data and to avoid overfitting, it is convenient to conceive a neural architecture with the smallest number of parameters to be estimated. In the following, we detail the motivations to employ a quaternion-valued RNN instead of a real-valued one to code inter and intra features dependencies with fewer parameters.

As a first step, a better representation of multidimensional data has to be explored to naturally capture internal relations within the input features. For example, an efficient way to represent the information composing an image is to consider each pixel as being a whole entity of three strongly related elements, instead of a group of uni-dimensional elements that could be related to each other, as in traditional real-valued neural networks. Indeed, with a real-valued RNN, the latent relations between the RGB components of a given pixel are hardly coded in the latent space since the weight has to find out these relations among all the pixels composing the image. This problem is effectively solved by replacing real numbers with quaternion numbers. Indeed, quaternions are fourth dimensional and allow one to build and process entities made of up to four related features. The quaternion algebra and more precisely the Hamilton product allows quaternion neural network to capture these internal latent relations within the features encoded in a quaternion. It has been shown that QNNs are able to restore the spatial relations within 33D coordinates (Matsui et al., 2004), and within color pixels (Isokawa et al., 2003), while real-valued NN failed. This is easily explained by the fact that the quaternion-weight components are shared through multiple quaternion-input parts during the Hamilton product , creating relations within the elements. Indeed, Figure 1 shows that the multiple weights required to code latent relations within a feature are considered at the same level as for learning global relations between different features, while the quaternion weight ww codes these internal relations within a unique quaternion QoutQ_{out} during the Hamilton product (right).

Then, while bigger neural networks allow better performances, quaternion neural networks make it possible to deal with the same signal dimension but with four times less neural parameters. Indeed, a 44-number quaternion weight linking two 4-number quaternion units only has 44 degrees of freedom, whereas a standard neural net parametrization has 4×4=164\times 4=16, i.e., a 4-fold saving in memory. Therefore, the natural multidimensional representation of quaternions alongside with their ability to drastically reduce the number of parameters indicate that hyper-complex numbers are a better fit than real numbers to create more efficient models in multidimensional spaces. Based on the success of previous deep quaternion convolutional neural networks and smaller quaternion feed-forward architectures (Kusamichi et al., 2004; Isokawa et al., 2009; Parcollet et al., 2017a), this work proposes to adapt the representation of hyper-complex numbers to the capability of recurrent neural networks in a natural and efficient framework to multidimensional sequential tasks such as speech recognition.

Modern automatic speech recognition systems usually employ input sequences composed of multidimensional acoustic features, such as log Mel features, that are often enriched with their first, second and third time derivatives (Davis & Mermelstein, 1990; Furui, 1986), to integrate contextual information. In standard RNNs, static features are simply concatenated with their derivatives to form a large input vector, without effectively considering that signal derivatives represent different views of the same input. Nonetheless, it is crucial to consider that time derivatives of the spectral energy in a given frequency band at a specific time frame represent a special state of a time-frame, and are linearly correlated (Tokuda et al., 2003). Based on the above motivations and the results observed on previous works about quaternion neural networks, we hypothesize that quaternion RNNs naturally provide a more suitable representation of the input sequence, since these multiple views can be directly embedded in the multiple dimensions space of the quaternion, leading to better generalization.

Quaternion recurrent neural networks

This Section describes the quaternion algebra (Section 3.1), the internal quaternion representation (Section 3.2), the backpropagation through time (BPTT) for quaternions (Section 3.3.2), and proposes an adapted weight initialization to quaternion-valued neurons (Section 3.4).

where rr, xx, yy, and zz are real numbers, and 11, i, j, and k are the quaternion unit basis. In a quaternion, rr is the real part, while xi+yj+zkx\textbf{i}+y\textbf{j}+z\textbf{k} with i2=j2=k2=ijk=1\textbf{i}^{2}=\textbf{j}^{2}=\textbf{k}^{2}=\textbf{i}\textbf{j}\textbf{k}=-1 is the imaginary part, or the vector part. Such a definition can be used to describe spatial rotations. The information embedded in the quaterion QQ can be summarized into the following matrix of real numbers:

The conjugate QQ^{*} of QQ is defined as:

Then, a normalized or unit quaternion QQ^{\triangleleft} is expressed as:

Finally, the Hamilton product \otimes between two quaternions Q1Q_{1} and Q2Q_{2} is computed as follows:

2 Quaternion representation

The QRNN is an extension of the real-valued (Medsker & Jain, 2001) and complex-valued (Hu & Wang, 2012; Song & Yam, 1998) recurrent neural networks to hypercomplex numbers. In a quaternion dense layer, all parameters are quaternions, including inputs, outputs, weights, and biases. The quaternion algebra is ensured by manipulating matrices of real numbers (Gaudet & Maida, 2018). Consequently, for each input vector of size NN, output vector of size MM, dimensions are split into four parts: the first one equals to rr, the second is xix\textbf{i}, the third one equals to yjy\textbf{j}, and the last one to zkz\textbf{k} to compose a quaternion Q=r1+xi+yj+zkQ=r1+x\textbf{i}+y\textbf{j}+z\textbf{k}. The inference process of a fully-connected layer is defined in the real-valued space by the dot product between an input vector and a real-valued M×NM\times N weight matrix. In a QRNN, this operation is replaced with the Hamilton product (Eq. 3.1) with quaternion-valued matrices (i.e. each entry in the weight matrix is a quaternion). The computational complexity of quaternion-valued models is discussed in Appendix 6.1.2

3 Learning algorithm

The QRNN differs from the real-valued RNN in each learning sub-processes. Therefore, let xtx_{t} be the input vector at timestep tt, hth_{t} the hidden state, WhxW_{hx}, WhyW_{hy} and WhhW_{hh} the input, output and hidden states weight matrices respectively. The vector bhb_{h} is the bias of the hidden state and ptp_{t}, yty_{t} are the output and the expected target vectors. More details of the learning process and the parametrization are available on Appendix 6.2.

Based on the forward propagation of the real-valued RNN (Medsker & Jain, 2001), the QRNN forward equations are extended as follows:

where α\alpha is a quaternion split activation function (Xu et al., 2017; Tripathi, 2016) defined as:

with ff corresponding to any standard activation function. The split approach is preferred in this work due to better prior investigations, better stability (i.e. pure quaternion activation functions contain singularities), and simpler computations. The output vector ptp_{t} is computed as:

where β\beta is any split activation function. Finally, the objective function is a classical loss applied component-wise (e.g., mean squared error, negative log-likelihood).

3.2 Quaternion Backpropagation Through Time

The backpropagation through time (BPTT) for quaternion numbers (QBPTT) is an extension of the standard quaternion backpropagation (Nitta, 1995), and its full derivation is available in Appendix 6.3. The gradient with respect to the loss EtE_{t} is expressed for each weight matrix as Δhyt=EtWhy\Delta^{t}_{hy}=\frac{\partial E_{t}}{\partial W_{hy}}, Δhht=EtWhh\Delta^{t}_{hh}=\frac{\partial E_{t}}{\partial W_{hh}}, Δhxt=EtWhx\Delta^{t}_{hx}=\frac{\partial E_{t}}{\partial W_{hx}}, for the bias vector as Δbt=EtBh\Delta^{t}_{b}=\frac{\partial E_{t}}{\partial B_{h}}, and is generalized to Δt=EtW\Delta^{t}=\frac{\partial E_{t}}{\partial W} with:

Each term of the above relation is then computed by applying the chain rule. Indeed, and conversaly to real-valued backpropagation, QBPTT must defines the dynamic of the loss w.r.t to each component of the quaternion neural parameters. As a use-case for the equations, the mean squared error at a timestep tt and named EtE_{t} is used as the loss function. Moreover, let λ\lambda be a fixed learning rate. First, the weight matrix WhyW_{hy} is only seen in the equations of ptp_{t}. It is therefore straightforward to update each weight of WhyW_{hy} at timestep tt following:

where hth^{*}_{t} is the conjugate of hth_{t}. Then, the weight matrices WhhW_{hh}, WhxW_{hx} and biases bhb_{h} are arguments of hth_{t} with ht1h_{t-1} involved, and the update equations are derived as:

with hnpreacth_{n}^{preact} and pnpreactp_{n}^{preact} the pre-activation values of hnh_{n} and pnp_{n} respectively.

4 Parameter initialization

A well-designed parameter initialization scheme strongly impacts the efficiency of a DNN. An appropriate initialization, in fact, improves DNN convergence, reduces the risk of exploding or vanishing gradient, and often leads to a substantial performance improvement (Glorot & Bengio, 2010). It has been shown that the backpropagation through time algorithm of RNNs is degraded by an inappropriated parameter initialization (Sutskever et al., 2013). Moreover, an hyper-complex parameter cannot be simply initialized randomly and component-wise, due to the interactions between components. Therefore, this Section proposes a procedure reported in Algorithm 1 to initialize a matrix WW of quaternion-valued weights. The proposed initialization equations are derived from the polar form of a weight ww of WW:

The angle θ\theta is randomly generated in the interval [π,π][-\pi,\pi]. The quaternion qimagq_{imag}^{\triangleleft} is defined as purely normalized imaginary, and is expressed as qimag=0+xi+yj+zkq_{imag}^{\triangleleft}=0+x\textbf{i}+y\textbf{j}+z\textbf{k}. The imaginary components xi, yj, and zk are sampled from an uniform distribution in $toobtainto obtainq_{imag},whichisthennormalized(followingEq.4)toobtain, which is then normalized (following Eq. 4) to obtainq_{imag}^{\triangleleft}.Theparameter. The parameter\varphiis a random number generated with respect to well-known initialization criterions (such as Glorot or He algorithms) (Glorot & Bengio, 2010; He et al., 2015). However, the equations derived in (Glorot & Bengio, 2010; He et al., 2015) are defined for real-valued weight matrices. Therefore, the variance ofWhastobeinvestigatedinthequaternionspacetoobtainhas to be investigated in the quaternion space to obtain\varphi(thefulldemonstrationisprovidedinAppendix6.2).Thevarianceof(the full demonstration is provided in Appendix 6.2). The variance ofW$ is:

The Glorot (Glorot & Bengio, 2010) and He (He et al., 2015) criterions are extended to quaternion as:

with ninn_{in} and noutn_{out} the number of neurons of the input and output layers respectively. Finally, φ\varphi can be sampled from [σ,σ][-\sigma,\sigma] to complete the weight initialization of Eq. 15.

Experiments

This Section details the acoustic features extraction (Section 4.1), the experimental setups and the results obtained with QRNNs, QLSTMs, RNNs and LSTMs on the TIMIT speech recognition tasks (Section 4.2). The results reported in bold on tables are obtained with the best configurations of the neural networks observed with the validation set.

The raw audio is first splitted every 1010ms with a window of 2525ms. Then 4040-dimensional log Mel-filter-bank coefficients with first, second, and third order derivatives are extracted using the pytorch-kaldipytorch-kaldi is available at https://github.com/mravanelli/pytorch-kaldi (Ravanelli et al., 2018b) toolkit and the Kaldi s5 recipes (Povey et al., 2011). An acoustic quaternion Q(f,t)Q(f,t) associated with a frequency ff and a time-frame tt is formed as follows:

Q(f,t)Q(f,t) represents multiple views of a frequency ff at time frame tt, consisting of the energy e(f,t)e(f,t) in the filter band at frequency ff, its first time derivative describing a slope view, its second time derivative describing a concavity view, and the third derivative describing the rate of change of the second derivative. Quaternions are used to learn the spatial relations that exist between the 33 described different views that characterize a same frequency (Tokuda et al., 2003). Thus, the quaternion input vector length is 160/4=40160/4=40. Decoding is based on Kaldi (Povey et al., 2011) and weighted finite state transducers (WFST) (Mohri et al., 2002) that integrate acoustic, lexicon and language model probabilities into a single HMM-based search graph.

2 The TIMIT corpus

The training process is based on the standard 3,6963,696 sentences uttered by 462462 speakers, while testing is conducted on 192192 sentences uttered by 2424 speakers of the TIMIT (Garofolo et al., 1993) dataset. A validation set composed of 400400 sentences uttered by 5050 speakers is used for hyper-parameter tuning. The models are compared on a fixed number of layers M=4M=4 and by varying the number of neurons NN from 256256 to 2,0482,048, and 6464 to 512512 for the RNN and QRNN respectively. Indeed, it is worth underlying that the number of hidden neurons in the quaternion and real spaces do not handle the same amount of real-number values. Indeed, 256256 quaternion neurons output are 256×4=1024256\times 4=1024 real values. Tanh activations are used across all the layers except for the output layer that is based on a softmax function. Models are optimized with RMSPROP with vanilla hyper-parameters and an initial learning rate of 81048\cdot 10^{-4}. The learning rate is progressively annealed using a halving factor of 0.50.5 that is applied when no performance improvement on the validation set is observed. The models are trained during 2525 epochs. All the models converged to a minimum loss, due to the annealed learning rate. A dropout rate of 0.20.2 is applied over all the hidden layers (Srivastava et al., 2014) except the output one. The negative log-likelihood loss function is used as an objective function. All the experiments are repeated 55 times (5-folds) with different seeds and are averaged to limit any variation due to the random initialization.

The results on the TIMIT task are reported in Table 1. The best PER in realistic conditions (w.r.t to the best validation PER) is 18.5%18.5\% and 19.0%19.0\% on the test set for QRNN and RNN models respectively, highlighting an absolute improvement of 0.5%0.5\% obtained with QRNN. These results compare favorably with the best results obtained so far with architectures that do not integrate access control in multiple memory layers (Ravanelli et al., 2018a). In the latter, a PER of 18.318.3% is reported on the TIMIT test set with batch-normalized RNNs . Moreover, a remarkable advantage of QRNNs is a drastic reduction (with a factor of 2.5×2.5\times) of the parameters needed to achieve these results. Indeed, such PERs are obtained with models that employ the same internal dimensionality corresponding to 1,0241,024 real-valued neurons and 256256 quaternion-valued ones, resulting in a number of parameters of 3.83.8M for QRNN against the 9.49.4M used in the real-valued RNN. It is also worth noting that QRNNs consistently need fewer parameters than equivalently sized RNNs, with an average reduction factor of 2.262.26 times. This is easily explained by considering the content of the quaternion algebra. Indeed, for a fully-connected layer with 2,0482,048 input values and 2,0482,048 hidden units, a real-valued RNN has 2,04824.22,048^{2}\approx 4.2M parameters, while to maintain equal input and output dimensions the quaternion equivalent has 512512 quaternions inputs and 512512 quaternion hidden units. Therefore, the number of parameters for the quaternion-valued model is 5122×41512^{2}\times 4\approx 1M. Such a complexity reduction turns out to produce better results and has other advantages such as a smaller memory footprint while saving models on budget memory systems. This characteristic makes our QRNN model particularly suitable for speech recognition conducted on low computational power devices like smartphones (Chen et al., 2014). QRNNs and RNNs accuracies vary accordingly to the architecture with better PER on bigger and wider topologies. Therefore, while good PER are observed with a higher number of parameters, smaller architectures performed at 23.9%23.9\% and 23.4%23.4\%, with 11M and 0.60.6M parameters for the RNN and the QRNN respectively. Such PER are due to a too small number of parameters to solve the task.

3 Quaternion long-short term memory neural networks

We propose to extend the QRNN to state-of-the-art models such as long-short term memory neural networks (LSTM), to support and improve the results already observed with the QRNN compared to the RNN in more realistic conditions. LSTM (Hochreiter & Schmidhuber, 1997) neural networks were introduced to solve the problems of long-term dependencies learning and vanishing or exploding gradient observed with long sequences. Based on the equations of the forward propagation and back propagation through time of QRNN described in Section 3.3.1, and Section 3.3.2, one can easily derive the equations of a quaternion-valued LSTM. Gates are defined with quaternion numbers following the proposal of Danihelka et al. (2016). Therefore, the gate action is characterized by an independent modification of each component of the quaternion-valued signal following a component-wise product with the quaternion-valued gate potential. Let ftf_{t},iti_{t}, oto_{t}, ctc_{t}, and hth_{t} be the forget, input, output gates, cell states and the hidden state of a LSTM cell at time-step tt:

where WW are rectangular input weight matrices, RR are square recurrent weight matrices, and bb are bias vectors. α\alpha is the split activation function and ×\times denotes a component-wise product between two quaternions. Both QLSTM and LSTM are bidirectional and trained on the same conditions than for the QRNN and RNN experiments.

The results on the TIMIT corpus reported on Table 2 support the initial intuitions and the previously established trends. We first point out that the best PER observed is 15.1%15.1\% and 15.3%15.3\% on the test set for QLSTMs and LSTM models respectively with an absolute improvement of 0.2%0.2\% obtained with QLSTM using 3.33.3 times fewer parameters compared to LSTM. These results are among the top of the line results (Graves et al., 2013b; Ravanelli et al., 2018a) and prove that the proposed quaternion approach can be used in state-of-the-art models. A deeper investigation of QLSTMs performances with the larger Wall Street Journal (WSJ) dataset can be found in Appendix 6.1.1.

Conclusion

Summary. This paper proposes to process sequences of multidimensional features (such as acoustic data) with a novel quaternion recurrent neural network (QRNN) and quaternion long-short term memory neural network (QLSTM). The experiments conducted on the TIMIT phoneme recognition task show that QRNNs and QLSTMs are more effective to learn a compact representation of multidimensional information by outperforming RNNs and LSTMs with 22 to 33 times less free parameters. Therefore, our initial intuition that the quaternion algebra offers a better and more compact representation for multidimensional features, alongside with a better learning capability of feature internal dependencies through the Hamilton product, have been demonstrated.

Future Work. Future investigations will develop other multi-view features that contribute to decrease ambiguities in representing phonemes in the quaternion space. In this extent, a recent approach based on a quaternion Fourier transform to create quaternion-valued signal has to be investigated. Finally, other high-dimensional neural networks such as manifold and Clifford networks remain mostly unexplored and can benefit from further research.

References

Appendix

This Section proposes to validate the scaling of the proposed QLSTMs to a bigger and more realistic corpus, with a speech recognition task on the Wall Street Journal (WSJ) dataset. Finally, it discuses the impact of the quaternion algebra in term of computational compexity.

We propose to evaluate both QLSTMs and LSTMs with a larger and more realistic corpus to validate the scaling of the observed TIMIT results (Section 4.2). Acoustic input features are described in Section 4.1, and extracted on both the 1414 hour subset ‘train-si84’, and the full 8181 hour dataset ’train-si284’ of the Wall Street Journal (WSJ) corpus. The ‘test-dev93’ development set is employed for validation, while ’test-eval92’ composes the testing set. Models architectures are fixed with respect to the best results observed with the TIMIT corpus (Section 4.2). Therefore, both QLSTMs and LSTMs contain four bidirectional layers of internal dimension of size 1,0241,024. Then, an additional layer of internal size 1,0241,024 is added before the output layer. The only change on the training procedure compared to the TIMIT experiments concerns the model optimizer, which is set to Adam (Kingma & Ba, 2014) instead of RMSPROP. Results are from a 33-folds average.

It is important to notice that reported results on Table 3 compare favorably with equivalent architectures (Graves et al., 2013a) (WER of 11.7%11.7\% on ’test-dev93’), and are competitive with state-of-the-art and much more complex models based on better engineered features (Chan & Lane, 2015)(WER of 3.8%3.8\% with the 81 hours of training data, and on ’test-eval92’). According to Table 3, QLSTMs outperform LSTM in all the training conditions (1414 hours and 8181 hours) and with respect to both the validation and testing sets. Moreover, QLSTMs still need 2.92.9 times less neural parameters than LSTMs to achieve such performances. This experiment demonstrates that QLSTMs scale well to larger and more realistic speech datasets and are still more efficient than real-valued LSTMs.

1.2 Notes on computational complexity

A computational complexity of O(n2)O(n^{2}) with nn the number of hidden states has been reported by Morchid (2018) for real-valued LSTMs. QLSTMs just involve 44 times larger matrices during computations. Therefore, the computational complexity remains unchanged and equals to O(n2)O(n^{2}). Nonetheless, and due to the Hamilton product, a single forward propagation between two quaternion neurons uses 2828 operations, compared to a single one for two real-valued neurons, implying a longer training time (up to 33 times slower). However, such worst speed performances could easily be alleviated with a proper engineered cuDNN kernel for the Hamilton product, that would helps QNNs to be more efficient than real-valued ones. A well-adapted CUDA kernel would allow QNNs to perform more computations, with fewer parameters, and therefore less memory copy operations from the CPU to the GPU.

2 Parameters initialization

Let us recall that a generated quaternion weight ww from a weight matrix WW has a polar form defined as:

with qimag=0+xi+yj+zkq_{imag}^{\triangleleft}=0+x\textbf{i}+y\textbf{j}+z\textbf{k} a purely imaginary and normalized quaternion. Therefore, ww can be computed following:

However, φ\varphi represents a randomly generated variable with respect to the variance of the quaternion weight and the selected initialization criterion. The initialization process follows (Glorot & Bengio, 2010) and (He et al., 2015) to derive the variance of the quaternion-valued weight parameters. Indeed, the variance of W has to be investigated:

With f(x)f(x) is the probability density function with four DOFs. A four-dimensional vector X={A,B,C,D}X=\{A,B,C,D\} is considered to evaluate the density function f(x)f(x). XX has components that are normally distributed, centered at zero, and independent. Then, AA, BB, CC and DD have density functions:

The four-dimensional vector XX has a length LL defined as L=A2+B2+C2+D2L=\sqrt{A^{2}+B^{2}+C^{2}+D^{2}} with a cumulative distribution function FL(x;σ)F_{L}(x;\sigma) in the 4-sphere (n-sphere with n=4n=4) SxS_{x}:

Therefore, by the Jacobian JfJ_{f}, we have the polar form:

Then, writing Eq.(55) in polar coordinates, we obtain:

The probability density function for XX is the derivative of its cumulative distribution function, which by the fundamental theorem of calculus is:

The expectation of the squared magnitude becomes:

Based on the L’Hôpital’s rule, the undetermined limit becomes:

The limit of first term is equals to with the same method than in Eq.(61). Therefore, the expectation is:

3 Quaternion backpropagation through time

Let us recall the forward equations and parameters needed to derive the complete quaternion backpropagation through time (QBPTT) algorithm.

Let xtx_{t} be the input vector at timestep tt, hth_{t} the hidden state, WhhW_{hh}, WxhW_{xh} and WhyW_{hy} the hidden state, input and output weight matrices respectively. Finally bhb_{h} is the biases vector of the hidden states and ptp_{t}, yty_{t} are the output and the expected target vector.

and α\alpha is the quaternion split activation function (Xu et al., 2017) of a quaternion QQ defined as:

and ff corresponding to any standard activation function. The output vector ptp_{t} can be computed as:

and β\beta any split activation function. Finally, the objective function is a real-valued loss function applied component-wise. The gradient with respect to the MSE loss is expressed for each weight matrix as EtWhy\frac{\partial E_{t}}{\partial W_{hy}}, EtWhh\frac{\partial E_{t}}{\partial W_{hh}}, EtWhx\frac{\partial E_{t}}{\partial W_{hx}}, and for the bias vector as EtBh\frac{\partial E_{t}}{\partial B_{h}}. In the real-valued space, the dynamic of the loss is only investigated based on all previously connected neurons. In this extent, the QBPTT differs from BPTT due to the fact that the loss must also be derived with respect to each component of a quaternion neural parameter, making it bi-level. This could act as a regularizer during the training process.

3.2 Output weight matrix

The weight matrix WhyW_{hy} is used only in the computation of ptp_{t}. It is therefore straightforward to compute EtWhy\frac{\partial E_{t}}{\partial W_{hy}}:

Each quaternion component is then derived following the chain rule:

By regrouping in a matrix form the hth_{t} components from these equations, one can define:

3.3 Hidden weight matrix

Conversely to WhyW_{hy} the weight matrix WhhW_{hh} is an argument of hth_{t} with ht1h_{t-1} involved. The recursive backpropagation can thus be derived as:

with NN the number of timesteps that compose the sequence. As for WhyW_{hy} we start with \pderivEkWhhr\pderiv{E_{k}}{W_{hh}^{r}}:

Non-recursive elements are derived w.r.t r, i,j, k:

The remaining terms \pderivhtrhmr\pderiv{h_{t}^{r}}{h_{m}^{r}},\pderivhtihmi\pderiv{h_{t}^{i}}{h_{m}^{i}},\pderivhtjhmj\pderiv{h_{t}^{j}}{h_{m}^{j}} and \pderivhtkhmk\pderiv{h_{t}^{k}}{h_{m}^{k}} are recursive and are written as:

The same operations are performed for i,j,k in Eq. 93 and EtWhh\frac{\partial E_{t}}{\partial W_{hh}} can finally be expressed as:

3.4 Input weight matrix

EtWhx\frac{\partial E_{t}}{\partial W_{hx}} is computed in the exact same manner as EtWhh\frac{\partial E_{t}}{\partial W_{hh}}.

Therefore EtWhx\frac{\partial E_{t}}{\partial W_{hx}} is easily extent as:

3.5 Hidden biases

EtBh\frac{\partial E_{t}}{\partial B_{h}} can easily be extended to:

Nonetheless, since biases are not connected to any inputs or hidden states, the matrix of derivatives defined in Eq. 84 becomes a matrix of 11. Consequently EtBh\frac{\partial E_{t}}{\partial B_{h}} can be summarized as: