Deep Quaternion Networks

Chase Gaudet, Anthony Maida

I Introduction

There have been many advances in deep neural network architectures in the past few years. One such improvement is a normalization technique called batch normalization that standardizes the activations of layers inside a network using minibatch statistics. It has been shown to regularize the network as well as provide faster and more stable training. Another improvement comes from architectures that add so called shortcut paths to the network. These shortcut paths connect later layers to earlier layers typically, which allows for the stronger gradients to propagate to the earlier layers. This method can be seen in Highway Networks and Residual Networks . Other work has been done to find new activation functions with more desirable properties. One example is the exponential linear unit (ELU) , which attempts to keep activations standardized. All of the above methods are combating the vanishing gradient problem that plagues deep architectures. With solutions to this problem appearing it is only natural to move to a system that will allow one to construct deeper architectures with as low a parameter cost as possible.

Other work in this area has explored the use of complex and hyper-complex numbers, which are a generalization of the complex, such as quaternions. Using complex numbers in recurrent neural networks (RNNs) has been shown to increase learning speed and provide a more noise robust memory retrieval mechanism . The first formulation of complex batch normalization and complex weight initialization is presented by where they achieve some state of the art results on the MusicNet data set. Hyper-complex numbers are less explored in neural networks, but have seen use in manual image and signal processing techniques . Examples of using quaternion values in networks is mostly limited to architectures that take in quaternion inputs or predict quaternion outputs, but do not have quaternion weight values . There are some more recent examples of building models that use quaternions represented as real-values. In they used a quaternion multi-layer perceptron (QMLP) for document understanding and uses a similar approach in processing multi-dimensional signals.

Building on our contribution in this paper is to formulate and implement quaternion convolution, batch normalization, and weight initialization Source code located at https://github.com/gaudetcj/DeepQuaternionNetworks. There arises some difficulty over complex batch normalization that we had to overcome as their is no analytic form for our inverse square root matrix.

II Motivation and Related Work

The ability of quaternions to effectively represent spatial transformations and analyze multi-dimensional signals makes them promising for applications in artificial intelligence.

One common use of quaternions is for representing rotation into a more compact form. PoseNet used a quaternion as the target output in their model where the goal was to recover the $6-$ DOF camera pose from a single RGB image. The ability to encode rotations may make a quaternion network more robust to rotational variance.

Quaternion representation has also been used in signal processing. The amount of information in the phase of an image has been shown to be sufficient to recover the majority of information encoded in its magnitude by Oppenheim and Lin . The phase also encodes information such as shapes, edges, and orientations. Quaternions can be represented as a 2 x 2 matrix of complex numbers, which gives them a group of phases potentially holding more information compared to a single phase.

Bulow and Sommer used the higher complexity representation of quaternions by extending Gabor’s complex signal to a quaternion one which was then used for texture segmentation. Another use of quaternion filters is shown in where they introduce a new class of filter based on convolution with hyper-complex masks, and present three color edge detecting filters. These filters rely on a three-space rotation about the grey line of RGB space and when applied to a color image produce an almost greyscale image with color edges where the original image had a sharp change of color. More quaternion filter use is shown in where they show that it is effective in the context of segmenting color images into regions of similar color texture. They state the advantage of using quaternion arithmetic is that a color can be represented and analyzed as a single entity (by assigning each color channel to an imaginary axis), which we will see holds for quaternion convolution in a convolutional neural network architecture as well in Section III-C.

A quaternionic extension of a feed forward neural network, for processing multi-dimensional signals, is shown in . They expect that quaternion neurons operate on multi-dimensional signals as single entities, rather than real-valued neurons that deal with each element of signals independently. A convolutional neural network (CNN) should be able to learn a powerful set of quaternion filters for more impressive tasks.

Another large motivation is discussed in , which is that complex numbers are more efficient and provide more robust memory mechanisms compared to the reals . They continue that residual networks have a similar architecture to associative memories since the residual shortcut paths compute their residual and then sum it into the memory provided by the identity connection. Again, given that quaternions can be represented as a complex group, they may provide an even more efficient and robust memory mechanisms.

III Quaternion Network Components

This section will include the work done to obtain a working deep quaternion network. Some of the longer derivations are given in the Appendix.

where $a$ is the real part, $(i,j,k)$ denotes the three imaginary axis, and $(b,c,d)$ denotes the three imaginary components. Quaternions are governed by the following arithmetic:

which, by enforcing distributivity, leads to the noncommutative multiplication rules

With our real-valued representation a quaternion real-valued $2D$ convolution layer can be expressed as follows. Say that the layer has $N$ feature maps such that $N$ is divisible by 4. We let the first $N/4$ feature maps represent the real components, the second $N/4$ represent the $i$ imaginary components, the third $N/4$ represent the $j$ imaginary components, and the last $N/4$ represent the $k$ imaginary components.

III-B Quaternion Differentiability

In order for the network to perform backpropagation the cost function and activation functions used must be differentiable with respect to the real, $i$ , $j$ , and $k$ components of each quaternion parameter of the network. As the complex chain rule is shown in , we provide the quaternion chain rule which is given in the Appendix section VII-A.

III-C Quaternion Convolution

Convolution in the quaternion domain is done by convolving a quaternion filter matrix $\textbf{W}=\textbf{A}+\textit{i}~{}\textbf{B}+\textit{j}~{}\textbf{C}+\textit{k}~{}\textbf{D}$ by a quaternion vector $\textbf{h}=\textbf{w}+\textit{i}~{}\textbf{x}+\textit{j}~{}\textbf{y}+\textit{k}~{}\textbf{z}$ . Here A, B, C, and D are real-valued matrices and w, x, y, and z are real-valued vectors. Performing the convolution by using the distributive property and grouping terms one gets

Using a matrix to represent the components of the convolution we have:

An example is shown in Fig. 1, which is useful to visualize one of the main motivational factors of quaternions for CNNs. Notice that the result of the quaternion convolution produces a unique linear combination of each axis per the result of a single axis. This comes from the structure of quaternion multiplication and is forcing each axis of the kernel to interact with each axis of the image. Real-valued convolution simply multiplies each channel of the kernel with the corresponding channel of the image. The quaternion convolution is similar to a mixture of standard convolution and depthwise separable convolution from . Depthwise separable convolution is where first a flat convolution kernel (no depth to match the depth of the feature image) is applied separately to each feature map. This is only giving spatial context on each feature map individually. Then a $1\times 1$ convolution is applied to the results of the previous operation to get a linear interaction of the feature maps, projecting them into a new feature map space.

The quaternion network’s reuse of filters on every axis and combination may help extract texture information across channels as seen in . One can think in terms of a RGB image where the greyscale of the image can be the real axis and the RGB channels individually can be the $i,j,k$ axes. Then a quaternion kernel convolved against this quaternion image will view the colors as a single entity, unlike standard real-valued convolution. Since a quaternion can be thought of as a vector, the quaternion kernels and feature maps can be thought of as vectors as well.

III-D Quaternion Batch-Normalization

Batch-normalization is used by the vast majority of all deep networks to stabilize and speed up training. It works by keeping the activations of the network at zero mean and unit variance. The original formulation of batch-normalization only works for real-values. Applying batch normalization to complex or hyper-complex numbers is more difficult, one can not simply translate and scale them such that their mean is 0 and their variance is 1. This would not give equal variance in the multiple components of a complex or hyper-complex number. To overcome this for complex numbers a whitening approach is used , which scales the data by the square root of their variances along each of the two principle components. We use the same approach, but must whiten 4D vectors.

where W is one of the matrices from the Cholesky decomposition of $\textbf{V}^{-1}$ where V is the covariance matrix given by:

Real-valued batch normalization also uses two learned parameters, $\beta$ and $\gamma$ . Our shift parameter $\beta$ must shift a quaternion value so it is a quaternion value itself with real, $i$ , $j$ , and $k$ as learnable components. The scaling parameter $\gamma$ is a symmetric matrix of size matching V given by:

III-E Quaternion Weight Initialization

The proper initialization of weights is vital to convergence of deep networks. In this work we derive our quaternion weight initialization using the same procedure as Glorot and Bengio and He et al. .

To begin we find the variance of a quaternion weight:

where $|W|$ is the magnitude, $\theta$ and $\phi$ are angle arguments, and $\mbox{cos}^{2}\phi_{1}+\mbox{cos}^{2}\phi_{2}+\mbox{cos}^{2}\phi_{3}=1$ .

where $f(x)$ is the four DOF distribution given in the Appendix.

To follow the Glorot and Bengio initialization we have $\mbox{Var}(W)=2/(n_{in}+n_{out})$ , where $n_{in}$ and $n_{out}$ are the number of input and output units respectivly. Setting this equal to (13) and solving for $\sigma$ gives $\sigma=1/\sqrt{2(n_{in}+n_{out})}$ . To follow He et al. initialization that is specialized for rectified linear units (ReLUs) , then we have $\mbox{Var}(W)=2/n_{in}$ , which again setting equal to (13) and solving for $\sigma$ gives $\sigma=1/\sqrt{2n_{in}}$ .

As shown in (10) the weight has components $|W|$ , $\theta$ , and $\phi$ . We can initialize the magnitude $|W|$ using our four DOF distribution defined with the appropriate $\sigma$ based on which initialization scheme we are following. The angle components are initialized using the uniform distribution between $-\pi$ and $\pi$ where we ensure the constraint on $\phi$ .

IV Experimental Results

Our experiments covered image classification using both the CIFAR-10 and CIFAR-100 benchmarks and image segmentation using the KITTI Road Estimation benchmark . The CIFAR datasets are $32\times 32$ color images of 10 and 100 classes receptively. Each image only contains one class and labels are provided. The KITTI dataset is large color images of varying sizes depicting roads as seen from a driver’s perspective. Each image has a corresponding label image in which each pixel is an integer value relating to a class of road or not road. We chose CIFAR because it is a extremely common benchmark task making it a good sanity check. The KITTI dataset was chosen because it is a fairly common, color segmentation benchmark and has binary classes which made it a simple test. All training was done on a single Nvidia 980Ti.

We use the same architecture as the large model in , which is a 110 layer Residual model similar to the one in . There is one difference between the real-valued network and the ones used for both the complex and hyper-complex valued networks. Because the datasets are all real-valued the network must learn the imaginary or quaternion components. We use the same technique as where there is an additional residual block immediately after the input which will learn the hyper-complex components

One of these blocks exist per imaginary component and are concatenated with the original input image. Another possible choice if using color images is to use the gray scale image as the real axis and then use the red, green, and blue channels as the $i,j,$ and $k$ axis respectively. With this choice it is not necessary to use the above block after input to learn the imaginary components.

To maintain the same parameter budget among the three network types we divided the number of filters per layer of the real network by a factor of two for the complex, and by a factor of four for the quaternion.

The architecture for all models consists of 3 stages of repeating residual blocks,

where at the end of each stage the images are downsized by a strided convolution. For these classification experiments we ran a shallow network where the stages contained 2, 1, and 1 residual blocks respectively and a deep network where the stages contained 10, 9, and 9 residual blocks respectively. Each stage also doubles the previous stage’s number of convolution kernels. For example the real model has 32 kernels in the first stage, 64 in the second, and finally 124 in the last. The last two layers are a global average pooling layer followed by a single fully connected layer with a softmax function used to classify the input as either one of the 10 classes in CIFAR-10 or one of the 100 classes in CIFAR-100.

We also followed their training procedure of using the backpropagation algorithm with Stochastic Gradient Descent with Nesterov momentum set at 0.9. The norm of the gradients were clipped to 1 and a custom learning rate scheduler was used. The learning scheduler was the same used in for a direct comparison in performance. The learning rate was initially set to 0.01 for the first 10 epochs and then set to 0.1 from epoch 11-100 and then cut by a factor of 10 at epochs 120 and 150. Table I presents our results alongside the real and complex valued networks. Our quaternion models outperform the real and complex networks on both datasets with a smaller parameter count. The quaternion models do take roughly 50% longer to train due to the computationally intense operations of quaternion batch normalization.

IV-B Segmentation

For this experiment we used the same model as the above, but cut the number of residual blocks out of the model for memory reasons given that the KITTI data is large color images about $1200\times 375$ pixels in size. We only use one model due to resource limitations for this experiment. It is the same as the small model from the classification experiments which has 2, 1, and 1 blocks at the three stages, but it does not perform any strided convolutions. It also does not have the global average pooling layer or the fully connected layer.

The last layer is a $1\times 1$ convolution with a sigmoid output so we are getting a heatmap prediction the same size as the input. The training procedure is also as above, but the learning rate is scheduled differently. Here we begin at 0.01 for the first 10 epochs and then set it to 0.1 from epoch 11-50 and then cut by a factor of 10 at 100 and 150. Table II presents our results along side the real and complex valued networks where we used Intersection over Union (IOU) for performance measure. Quaternion outperformed the other two by a larger margin compared to the classification tasks and again, with a smaller parameter count.

V Conclusions

We have extended upon work looking into complex valued networks by exploring quaternion values. We presented the building blocks required to build and train deep quaternion networks and used them to test residual architectures on two common image classification benchmarks. We show that they have competitive performance by beating both the real and complex valued networks with less parameters. Future work will be needed to test quaternion networks for more segmentation datasets and for audio processing tasks.

VI Acknowledgment

We would like to thank James Dent of the University of Louisiana at Lafayette Physics Department for helpful discussions. We also thank Fugro for research time on this project.

References

VII Appendix

VII-B Whitening a Matrix

Let X be an $n$ x $n$ matrix and $\mbox{cov}(\textbf{X})=\mathbf{\Sigma}$ is the symmetric covariance matrix of the same size. Whitening a matrix linearly decorrelates the input dimensions, meaning that whitening transforms X into Z such that $\mbox{cov}(\textbf{Z})=\textbf{I}$ where I is the identity matrix . The matrix Z can be written as:

where W is an $n$ x $n$ ‘whitening’ matrix. Since $\mbox{cov}(\textbf{Z})=\textbf{I}$ it follows that:

From (17) it is clear that the Cholesky decomposition provides a suitable (but not unique) method of finding W.

VII-C Cholesky Decomposition

Cholesky decomposition is an efficient way to implement LU decomposition for symmetric matrices, which allows us to find the square root. Consider $\textbf{A}\textbf{X}=\textbf{b}$ , $\textbf{A}=[a_{ij}]_{n\times n}$ , and $a_{ij}=a_{ji}$ , then the Cholesky decomposition of A is given by $\textbf{A}=\textbf{L}\textbf{L}^{\prime}$ where

Let $l_{ki}$ be the $k^{th}$ row and $i^{th}$ column entry of L, then

VII-D 4 DOF Independent Normal Distribution

Consider the four-dimensional vector $\textbf{Y}=(S,T,U,V)$ which has components that are normally distributed, centered at zero, and independent. Then $S$ , $T$ , $U$ , and $V$ all have density functions

Let X be the length of Y, which means $\textbf{X}=\sqrt{S^{2}+T^{2}+U^{2}+V^{2}}$ . Then X has the cumulative distribution function

where $H_{x}$ is the four-dimensional sphere

We then can write the integral in polar representation

The probability density function of X is the derivative of its cumulative distribution function so we use the funamental theorem of calculus on (22) to finally arrive at