Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group

Mario Lezcano-Casado, David Martínez-Rubio

Introduction

Training deep neural networks presents many difficulties. One of the most important is the exploding and vanishing gradient problem, as first observed and studied in (Bengio et al., 1994). This problem arises from the ill-conditioning of the function defined by a neural network as the number of layers increase. This issue is particularly problematic in Recurrent Neural Networks (rnns). In rnns the eigenvalues of the gradient of the recurrent kernel explode or vanish exponentially fast with the number of time-steps whenever the recurrent kernel does not have unitary eigenvalues (Arjovsky et al., 2016). This behavior is the same as the one encountered when computing the powers of a matrix, and results in very slow convergence (vanishing gradient) or a lack of convergence (exploding gradient).

In the seminal paper (Arjovsky et al., 2016), they note that unitary matrices have properties that would solve the exploding and vanishing gradient problems. These matrices form a group called the unitary group and they have been studied extensively in the fields of Lie group theory and Riemannian geometry. Optimization methods over the unitary and orthogonal group have found rather fruitful applications in rnns in recent years (\cf Section 2).

In parallel to the work on unitary rnns, there has been an increasing interest for optimization over the orthogonal group and the Stiefel manifold in neural networks (Harandi & Fernando, 2016; Ozay & Okatani, 2016; Huang et al., 2017; Bansal et al., 2018). As shown in these papers, orthogonal constraints in linear and cnn layers can be rather beneficial for the generalization of the network as they act as a form of implicit regularization. The main problem encountered while using these methods in practice was that optimization with orthogonality constraints was neither simple nor computationally cheap. We aim to close that bridge.

In this paper we present a simple yet effective way to approach problems that present orthogonality or unitary constraints. We build on results from Riemannian geometry and Lie group theory to introduce a parametrization of these groups, together with theoretical guarantees for it.

This parametrization has several advantages, both theoretical and practical:

It can be used with general purpose optimizers.

The parametrization does not create additional minima or saddle points in the main parametrization region.

It is possible to use a structured initializer to take advantage of the structure of the eigenvalues of the orthogonal matrix.

Other approaches need to enforce hard orthogonality constraints, ours does not.

Most previous approaches fail to satisfy one or many of these points. The parametrization in (Helfrich et al., 2018) and (Maduranga et al., 2018) comply with most of these points but they suffer degeneracies that ours solves (\cf Remark in Section 4.2). We compare our architecture with other methods to optimize over $\operatorname{SO}\lparen n\rparen$ and $\operatorname{U}\lparen n\rparen$ in the remarks in Sections 3 and 4.

The matrix exponential maps skew-symmetric matrices to orthogonal matrices transforming an optimization problem with orthogonal constraints into an unconstrained one. We use Padé approximants and the scale-squaring trick to compute machine-precision approximations of the matrix exponential and its gradient. We can implement the parametrization with negligible overhead observing that it does not depend on the batch size.

Structure of the Paper

In Section 3, we introduce the parametrization and present the theoretical results that support the efficiency of the exponential parametrization. In Section 4, we explain the implementation details of the layer. Finally, in Section 5, we present the numerical experiments confirming the numerical advantages of this parametrization.

Related Work

There is a vast literature on optimization methods on Riemannian manifolds, and in particular for matrix manifolds, both in the deterministic and the stochastic setting. Most of the classical convergence results from the Euclidean setting have been adapted to the Riemannian one (Absil et al., 2009; Bonnabel, 2013; Boumal et al., 2016; Zhang et al., 2016; Sato et al., 2017). On the other hand, the problem of adapting popular optimization algorithms like rmsprop (Tieleman & Hinton, 2012), adam (Kingma & Ba, 2014) or adagrad (Duchi et al., 2011) is a topic of current research (Kumar Roy et al., 2018; Becigneul & Ganea, 2019).

Optimization over the Orthogonal and Unitary groups.

The first formal study of optimization methods on manifolds with orthogonal constraints (Stiefel manifolds) is found in the thesis (Smith, 1993). These ideas were later simplified in the seminal paper (Edelman et al., 1998), where they were generalized to Grassmannian manifolds and extended to get the formulation of the conjugate gradient algorithm and the Newton method for these manifolds. After that, optimization with orthogonal constraints has been a central topic of study in the optimization community. A rather in depth literature review of existing methods for optimization with orthogonality constraints can be found in (Jiang & Dai, 2015). When it comes to the unitary case, the algorithms used in practice are similar to those used in the real case, \cf (Manton, 2002; Abrudan et al., 2008).

Unitary rnns.

The idea of parametrizing the matrix that defines an RNN by a unitary matrix was first proposed in (Arjovsky et al., 2016). Their parametrization centers on a matrix-based fast Fourier transform-like (fft) approach. As pointed out in (Jing et al., 2017), this representation, although efficient in memory, does not span the whole space of unitary matrices, giving the model reduced expressiveness. This second paper solves this issue in the same way it is solved when computing the fft—using $\log(n)$ iterated butterfly operations. A different approach to perform this optimization was presented in (Wisdom et al., 2016; Vorontsov et al., 2017). Although not mentioned explicitly in either of the papers, this second approach consists of a retraction-based Riemannian gradient descent via the Cayley transform. The paper (Hyland & Rätsch, 2017) proposes to use the exponential map on the complex case, but they do not perform an analysis of the algorithm or provide a way to approximate the map nor the gradients. A third approach has been presented in (Mhammedi et al., 2017) via the use of Householder reflections. Finally, in (Helfrich et al., 2018) and the follow-up (Maduranga et al., 2018), a parametrization of the orthogonal group via the use of the Cayley transform is proposed. We will have a closer look at these methods and their properties in Sections 3 and 4.

Parametrization of Compact Lie Groups

For a much broader introduction to Riemannian geometry and Lie group theory see Appendix A. We will restrict our attention to the special orthogonal Note that we consider just matrices with determinant equal to one, since the full group of orthogonal matrices $\operatorname{O}\lparen n\rparen$ is not connected, and hence, not amenable to gradient descent algorithms. and unitary case, but the results in this section can be generalized to any connected compact matrix Lie group equipped with a bi-invariant metric. We prove the results for general connected compact matrix Lie groups in Appendix C.

We are interested in the study of parametrizations of the special orthogonal group

We call the tangent space at the identity element of the group the Lie algebra of the group. For the two groups of interest, their Lie algebras are given by

That is, the skew-symmetric and the skew-Hermitian matrices respectively. Note that these two spaces are isomorphic to a vector space. For example, for $\mathfrak{so}\lparen n\rparen$ , the isomorphism is given by

2 Parametrizing SO⁡(n)SO𝑛\operatorname{SO}\lparen n\rparen and U⁡(n)U𝑛\operatorname{U}\lparen n\rparen

In the theory of Lie groups there exists a tight connection between the structure of the Lie algebra and the geometry of the Lie group. One of the most important tools that is used to study one in terms of the other is the Lie exponential map. The Lie exponential map on matrix Lie groups with a bi-invariant metric is given by the exponential of matrices. If we denote the group by $G$ (which would be $\operatorname{SO}\lparen n\rparen$ or $\operatorname{U}\lparen n\rparen$ in this case) and its Lie algebra by $\mathfrak{g}$ , we have the mapping $\exp:\mathfrak{g}\to G$ defined as

This mapping is not surjective in general. On the other hand, there are particular families of Lie groups in which the exponential map is, in fact, surjective. Compact Lie groups are one of such families.

The Lie exponential map on a connected, compact Lie group is surjective.

We give a short self-contained proof of this classical result in Appendix C. We give an alternative, less abstract proof of this fact for the groups $\operatorname{SO}\lparen n\rparen$ and $\operatorname{U}\lparen n\rparen$ as a corollary of Proposition 3.2 in Appendix D. ∎

Both $\operatorname{SO}\lparen n\rparen$ and $\operatorname{U}\lparen n\rparen$ are compact and connected, so this result applies to them. As such, the exponential of matrices gives a complete parametrization of these groups.

3 From Riemannian to Euclidean optimization

In this section we describe some properties of the exponential parametrization which make it a sound choice for optimization with orthogonal constraints in neural networks.

Fix $G$ to be $\operatorname{SO}\lparen n\rparen$ or $\operatorname{U}\lparen n\rparen$ equipped with the metricNote that in the real case we have that $A^{\ast}=A^{\intercal}$ . $\langle X,Y\rangle=\operatorname{tr}\lparen X^{\ast}Y\rparen$ and let $\mathfrak{g}$ be its Lie algebra (the space of skew-symmetric or skew-Hermitian matrices). The exponential parametrization satisfies the following properties.

The exponential parametrization allows us to pullback an optimization problem from the group $G$ back to the Euclidean space. If we have a problem

We noted in Section 3.1 that $\mathfrak{g}$ is isomorphic to a Euclidean vector space, and as such we can use regular gradient descent optimizers like adam or adagrad to approximate a solution to problem (2).

A rather natural question to ask is whether using gradient-based methods to approximate the solution of problem (2) would give a sensible solution to problem (1), given that precomposing with the exponential map might change the geometry of the problem. If the parametrization is, for example not locally unique, this might degrade the gradient flow and affect the performance of the gradient descent algorithm. In this section we will show theoretically that this parametrization has rather desirable properties for a parametrization of a manifold. We will confirm that these properties have a positive effect on the convergence of the gradient descent algorithms when compared with other parametrizations when applied to real problems in Section 5.

It does not change the minimization problem.

It is clear that a minimizer $\widehat{B}$ for problem (1) and a minimizer $\widehat{A}$ for problem (2) will be related by the equation $\widehat{B}=\exp(\widehat{A})$ , since the exponential map is surjective, so if we find a solution to the second problem we will have a solution to the first one.

It acts as a change of metric on the group.

If the parametrization did not induce a change of metric on the manifold it could mean that it would induce saddle points, which would potentially slow down the convergence of the optimization algorithm.

A map $\phi\colon\mathcal{M}\to\mathcal{N}$ with $\mathcal{N}$ a Riemannian manifold induces a metric on a differentiable manifold $\mathcal{M}$ whenever it is an immersion, that is, its differential is injective. The Lie exponential is not just an immersion, it is bi-analytic on an open neighborhood around the origin. The image of this neighborhood is sufficiently large to cover almost all the Lie group.

Let $G$ be $\operatorname{SO}\lparen n\rparen$ or $\operatorname{U}\lparen n\rparen$ . The exponential map is analytic, invertible, with analytic inverse on a bounded open neigborhood $V$ of the origin and $\exp(V)$ covers almost all $G$ in the sense that the whole group lies in the closure of $\exp(V)$ .

This is the map that induces the new metric on $G$ , through the pushforward of the canonical metric from the Lie algebra into the Lie group. As such, the optimization process using our parametrization can be seen as Riemannian gradient descent using this new metric, and all the existent results developed for optimization over manifolds apply to this setting.

We saw empirically that whenever the initialization of the skew-symmetric matrix starts in $V$ , the optimization path throughout all the training epochs does not leave $V$ . For this reason, in practice the exponential parametrization behaves as a change of metric on the Lie group.

The induced metric is different to the classic one.

The standard first order optimization technique to solve problem (1) is given by Riemannian gradient descent (Absil et al., 2009). In the Riemannian setting, we have the Riemannian exponential map $\exp_{B}$ which maps lines that pass through the origin on the tangent space $T_{B}G$ to geodesics on $G$ that pass through $B$ . In the special orthogonal or unitary case, when we choose the metric induced by the canonical metric on the ambient space, for a function defined on the ambient space, this translates to the update rule

for a learning rate $\eta>0$ , where $\exp$ is the exponential of matrices and $\operatorname{grad}f(B)$ denotes the gradient of the function restricted to $G$ . We deduce this formula in Example C.1.

Computing the Riemannian exponential map exactly is computationally expensive in many practical situations. For this reason, approximations are in order. Retractions are of particular interest.

A retraction $r$ for a manifold $\mathcal{M}$ is defined as a family of functions $r_{x}\colon T_{x}\mathcal{M}\to\mathcal{M}$ for every $x\in\mathcal{M}$ such that

In other words, retractions are a first order approximation of the Riemannian exponential map. A study of the convergence properties of first and second-order optimization algorithms when using retractions can be found in (Boumal et al., 2016). In the case of $G$ , we have that a way to form retractions is to choose a function $\phi\colon\mathfrak{g}\to G$ such that it is a first order approximation of the exponential of matrices and its image lies in $G$ . Then, the update rule is given by

For the special orthogonal and the unitary group, one such function is the Cayley map

This justifies theoretically the optimization methods used in (Wisdom et al., 2016; Vorontsov et al., 2017) and extends their work, given that all their architectures can still be applied with different retractions for these manifolds. In Section 4.2 we give examples of more involved retractions, and in Section 4.3 we explain why it is computationally cheap to use machine-accuracy approximants to compute the exponential map both in our approach and in the Riemannian gradient descent approach. Examples of other retractions and a deeper treatment of these objects can be found in Appendix B.

The update rule for the exponential parametrization induces a retraction-like map for $A\in\mathfrak{g}$

where the gradient is the gradient with respect to the Euclidean metric, that is, the regular gradient, given that $f\circ\exp$ is defined on a Euclidean space. A natural question that arises is whether this new update rule defines a retraction. It turns out that this map is not a retraction for $\operatorname{SO}\lparen n\rparen$ or $\operatorname{U}\lparen n\rparen$ .

The step-update map induced by the exponential parametrization is not a retraction for $\operatorname{SO}\lparen n\rparen$ if $n>2$ nor for $\operatorname{U}\lparen n\rparen$ if $n>1$ .

It is a corollary of Theorem C.12, where we give necessary and sufficient conditions for this map to be a retraction when defined on a compact, connected matrix Lie group.

Numerical Implementation

As an application of this framework we show how to model an orthogonal (or unitary) recurrent neural network with it, that is, an rnn whose recurrent matrix is orthogonal (or unitary). We also show how to implement numerically the ideas of the last section.

2 Approximating the exponential of matrices

There is a myriad of methods to approximate the exponential of a matrix (Moler & Van Loan, 2003). Riemannian gradient descent over $\operatorname{SO}\lparen n\rparen$ requires that the result of the approximation is orthogonal. If not, the error would accumulate after each step making the resulting matrix deviate from the orthogonality constraint exponentially fast in the number of steps. On the other hand, the approximation of the exponential in our parametrization does not require orthogonality. This allows many other approximations of the exponential function. The requirement is removed because the orthogonal matrix is implicitly represented as an exponential of a skew-symmetric matrix. The loss of orthogonality in Riemannian gradient descent is due to storing an orthogonal matrix and updating it directly.

Padé approximants are rational approximations of the form $\exp(A)\approx p_{n}(A)q_{n}(A)^{-1}$ for polynomials $p_{n},q_{n}$ of degree $n$ . A Padé approximant of degree $n$ agrees with the Taylor expansion of the exponential to degree $2n$ . The Cayley transform is the Padé approximant of degree $1$ . These methods and their implementations are described in detail in (Higham, 2009).

Scale-squaring trick.

Given that the Cayley transform is a degree $1$ Padé approximant of the exponential, if we choose this approximant without the scale-squaring trick we essentially recover the parametrization proposed in (Helfrich et al., 2018). The Cayley transform suffers from the fact that, if the optimum has $-1$ as an eigenvalue, the weights of the corresponding skew-symmetric matrix will tend to infinity. The parametrization in (Helfrich et al., 2018) is the Cayley transform multiplied by a diagonal matrix $D$ , but the parametrization still has the same problem, it just moves it to a different eigenvalue. Proposition 3.2 assures that the exponential parametrization does not suffer from this problem.

The follow-up work (Maduranga et al., 2018) mitigates this problem learning the diagonal of $D$ as well, but by doing so it loses the local unicity of the parametrization. Proposition 3.2 assures that the exponential parametrization is not only locally unique, but also differentiable with differentiable inverse, thus inducing a metric.

It is straightforward to show that a degree $n$ Padé approximant combined with the scaling-squaring trick also maps the skew-symmetric matrices to the special orthogonal matrices. This constitutes a much more precise retraction than the Cayley map at almost no extra computational cost. This observation can be used to improve the precision of the method proposed in (Wisdom et al., 2016; Vorontsov et al., 2017).

Exact approximation.

Combining the methods above we can get an efficient approximation of the exponential to machine-precision. The best one known to the authors is based on the paper (Al-Mohy & Higham, 2009b). It accounts for an efficient use of the scaling-squaring trick and a Padé approximant. This is the algorithm that we use on the experiments section to approximate the exponential.

Exact gradients.

A problem often encountered in practice is that of biased gradients. Even though an approximation might be good, it can significantly bias the gradient. An example of this would be trying to approximate the function $f\equiv 0$ on $ $by the functions$ f_{n}(x)=\frac{\sin(2\pi nx)}{n} $. Even though$ f_{n}\to f$ uniformly, their derivatives do not converge to zero. This problem is rather common when using involved parametrizations, for example, those coming from Chebyshev polynomials. On the other hand, the gradient can be implemented separately using a machine-precision formula.

It follows from the discussion in Example C.1 and Proposition C.11 in the supplementary material. ∎

3 Parametrizations are computationally cheap.

4 Intialization

For the initialization of the layer with a matrix $A_{0}\in\operatorname{Skew}\lparen p\rparen$ , we drew ideas from both (Henaff et al., 2016) and (Helfrich et al., 2018). Both initializations sample blocks of the form

for $s_{i}$ i.i.d. and then form $A_{0}$ as a block-diagonal matrix with these blocks.

The Henaff initialization consists of sampling $s_{i}\sim\operatorname{\mathcal{U}}[-\pi,\pi]$ . This defines a block-diagonal orthogonal matrix $e^{A}$ with uniformly distributed blocks on the corresponding torus of block-diagonal $2\times 2$ rotations. We sometimes found that the sampling presented in (Helfrich et al., 2018) performed better. This initialization, which we call Cayley, accounts for sampling $u_{i}\sim\operatorname{\mathcal{U}}[0,\frac{\pi}{2}]$ and then setting $s_{i}=-\sqrt{\frac{1-\cos(u_{i})}{1+\cos(u_{i})}}$ , thus biasing the eigenvalues towards .

We chose as the initial vector $h_{0}=0$ for simplicity, as we did not observe any empirical improvement when using the initialization given in (Arjovsky et al., 2016).

Experiments

In this section we compare the performance of our parametrization for orthogonal rnns with the following approaches:

(Hochreiter & Schmidhuber, 1997) • Unitary rnn (urnn). (Arjovsky et al., 2016) • Efficient Unitary rnn (eurnn). (Jing et al., 2017) • Cayley Parametrization (scornn). (Helfrich et al., 2018) • Riemannian Gradient Descent (rgd). (Wisdom et al., 2016) Figure 1: Cross entropy of the different algorithms in the copying problem for $L=1000$ (left) and $L=2000$ (right). We use three tasks that have become standard to measure the performance of rnns and their ability to deal with long-term dependencies. These are the copying memory task, the pixel-permuted mnist task, and the speech prediction on the timit dataset (Arjovsky et al., 2016; Wisdom et al., 2016; Henaff et al., 2016; Mhammedi et al., 2017; Helfrich et al., 2018). In Appendix E, we enumerate the hyperparameters used for the experiments. The sizes of the hidden layer were chosen to match the number of learnable parameters of the other architectures. Remark. We found empirically that having a learning rate for the orthogonal parameters that is $10$ times larger than that of the non-orthogonal parameters yields a good performance in practice.

For the other experiments, we executed the code that the other authors provided with the best hyperparameters that they reported and a batch of $128$ . The results for eurnn are those reported in (Jing et al., 2017), and for rgd and urnn are those reported in (Helfrich et al., 2018).

The code with the exact configuration and seeds to replicate these results, and a plug-and-play implementation of exprnn and the exponential framework can be found in https://github.com/Lezcano/expRNN.

The copying memory task was first proposed in (Hochreiter & Schmidhuber, 1997). The task can be defined as follows. Let $\mathcal{A}=\{a_{k}\}_{k=1}^{N}$ be an alphabet and let , be two symbols not contained in $\mathcal{A}$ . For a sequence length of $K$ and a spacing of length $L$ , the input sequence would be $K$ ordered characters $\lparen b_{k}\rparen_{k=1}^{K}$ sampled i.i.d. uniformly at random from $\mathcal{A}$ , followed by $L$ repetitions of the character , the character and finally $K-1$ repetitions of the character again. The output for this sequence would be $K+L$ times the character and then the sequence of characters $\lparen b_{k}\rparen_{k=1}^{K}$ . In other words, the system has to recall the initial $K$ characters and reproduce them after detecting the input of the character , which appears $L$ time-steps after the end of the input characters. For example, for $N=4$ , $K=5$ , $L=10$ , if we represent with a dash and with a colon, and the alphabet $\mathcal{A}=\{1,\dots,4\}$ , the following sequences could be an element of the dataset:

The loss function for this task is the cross entropy. The standard baseline for this task is the output of $K+L$ symbols, followed by the remaining $K$ symbols being output at random. This strategy yields a cross entropy of $K\log\lparen N\rparen/\lparen L+2K\rparen$ .

We observe that the training of scornn is unstable, which is probably due to the degeneracies explained in the remark in Section 4. In the follow-up paper (Maduranga et al., 2018), scurnn presents the same instabilities as its predecessor. As explained in Section 4, exprnn does not suffer of this, and can be observed in our experiments as a smoother convergence. In the more difficult problem, $L=2000$ , exprnn is the only architecture that is able to fully converge to the correct answer.

2 Pixel-by-Pixel mnist

In this task we use the mnist dataset of hand-written numbers (LeCun & Cortes, 2010) of images of size $28\times 28$ , only this time the images are flattened and are processed as an array of $784$ pixels, which is treated as a stream that is fed to the rnn, as described in (Le et al., 2015). In the unpermuted task, the stream is processed in a row-by-row fashion, while in the permuted task, a random permutation of the $784$ elements is chosen at the beginning of the experiment, and all the pixels of all the images in the experiment are permuted according to this permutation. The final output of the rnn is processed as the encoding of the number and used to solve the corresponding classification task.

In this experiment we observed that exprnn is able to saturate the capacity of the orthogonal rnn model for this task much faster than any other parametrization, as per Table 1. We conjecture that coupling the exponential parametrization with an lstm cell or a gru cell would yield a superior architecture. We leave this for future research.

3 timit Speech Dataset

We performed speech prediction on audio data with our model. We used the timit speech dataset (S Garofolo et al., 1992) which is a collection of real-world speech recordings. The task accounts for predicting the log-magnitude of incoming frames of a short-time Fourier transform (stft) as it was first proposed in (Wisdom et al., 2016).

We use the separation in train / test proposed in the original paper, having $3640$ utterances for the training set, a validation set of size $192$ , and a test set of size $400$ . The validation / test division and the whole preprocessing of the dataset was done according to (Wisdom et al., 2016). The preprocessing goes as follows: The data is sampled at $8$ kHz and then cut into time frames of the same size. These frames are then transformed into the log-magnitude Fourier space and finally, they are normalized according to a per-training set, test set, and validation set basis.

The results for this experiment are shown in Table 2. Again, the exponential parametrization beats—by a large margin—other methods of parametrization over the orthogonal group, and also the lstm architecture. The results in Table 2 are those reported in (Helfrich et al., 2018).

As a side note, we must say that the results in this experiment should be interpreted under the following fact: We had access to two of the implementations for the tests for the other architectures regarding this experiment, and neither of them correctly handled sequences with different lengths present in this experiment. We suspect that the other implementations followed a similar approach, given that the results that they get are of the same order. In particular, the implementation released by Wisdom, which is the only publicly available implementation of this experiment, divides by a larger number than it should when computing the average mse of a batch, hence reporting a lower mse than the correct one. Even in this unfavorable scenario, our parametrization is able to get results that are twice as good—the mse loss function is a quadratic function—as those from the other architectures.

Conclusion and Future Work

In this paper we have presented three main ideas. First, a simple approach based on classic Lie group theory to perform optimization over compact Lie groups, in particular $\operatorname{SO}\lparen n\rparen$ and $\operatorname{U}\lparen n\rparen$ , proving its soundness and providing empirical evidence of its superior performance. Second, an implementation trick that allows for the implementation of arbitrary parametrizations at a negligible runtime cost. Finally, we sketched how to improve some existing methods to perform optimization on Lie groups using Riemannian gradient descent. Any of these three ideas is of independent interest and could have more applications within neural networks.

The investigation of how to couple these ideas with the lstm architecture to improve its performance is left for future work.

Additionally, it could be of interest to see how orthogonal constraints help with learning in deep feed forward networks. In order to make this last point formal, one would have to generalize the results presented here to homogeneous Riemannian manifolds, like the Stiefel manifold.

Acknowledgements

We would like to thank the help of Prof. Raphael Hauser and Jaime Mendizabal for the useful conversations, Daniel Feinstein for the proofreading, Prof. Terry Lyons for the computing power, and Kyle Helfrich for helping us setting up the experiments.

The work of MLC was supported by the Oxford-James Martin Graduate Scholarship and the “la Caixa” Banking Foundation (LCF/BQ/EU17/11590067). The work of DMR was supported by EP/N509711/1 from the EPSRC MPLS division, grant No 2053152.

References

Appendix A Riemannian Geometry and Lie Groups

In this section we aim to give a short summary of the basics of classical Riemannian geometry and Lie group theory needed for our proofs in following sections. The standard reference for classical Riemannian geometry is do Carmo’s book (do Carmo, 1992). An elementary introduction to Lie group theory with an emphasis on concrete examples from matrix theory can be found in (Hall, 2015).

An affine connection $\nabla$ on a smooth manifold is a bilinear application that maps two vector fields $X,Y$ to a new one $\nabla_{X}Y$ such that it is linear in $X$ , and linear and Leibnitz in $Y$ .

We say that a connection is compatible with the metric if for any two parallel vector fields $X,Y$ along $\gamma$ , their scalar product is constant. In other words, the connection preserves the angle between parallel vector fields. We say that a connection is torsion-free if $\nabla_{X}Y-\nabla_{Y}X=[X,Y]:=XY-YX$ . In a Riemannian manifold, there exists a unique affine connection such that it is compatible with the metric and that is also torsion-free. We call this distinguished connection the Levi-Civita connection.

A geodesic is defined as a curve such that its tangent vectors are covariantly constant along itself, $\nabla_{\gamma^{\prime}}\gamma^{\prime}=0$ . It is not true in general that given two points in a manifold there exists a geodesic that connects them. However, the Hopf-Rinow theorem states that this is indeed the case if the manifold is connected and complete as a metric space. The manifolds that we will consider are all connected and complete.

At every point $p$ in our manifold we can define the Riemannian exponential map $\exp_{p}\colon T_{p}\mathcal{M}\to\mathcal{M}$ , which maps a vector $v$ to $\gamma(1)$ where $\gamma$ is the geodesic such that $\gamma(0)=p$ , $\gamma^{\prime}(0)=v$ . In a complete manifold, another formulation of the Hopf-Rinow theorem says that the exponential map is defined on the whole tangent space for every $p\in\mathcal{M}$ . We also have that the Riemannian exponential map maps diffeomorphically a neighborhood around zero on the tangent space to a neighborhood of the point on which it is defined.

A map between two Riemannian manifolds is called a (local) isometry if it is a (local) diffeomorphism and its differential respects the metric.

A.2 Lie groups

A Lie group is a smooth manifold equipped with smooth group multiplication and inverse. Examples of Lie groups are the Euclidean space equipped with its additive group structure and the general linear group $\operatorname{GL}$ of a finite dimensional vector space given by the invertible linear endomorphisms of the space equipped with the composition of morphisms. We say that a Lie group is a matrix Lie group if it is a closed subgroup of some finite-dimensional general linear group.

Appendix B Retractions

We take a deeper look into the concept of a retraction, which helps understanding the correctness of the approach used to optimize on $\operatorname{SO}\lparen n\rparen$ and $\operatorname{U}\lparen n\rparen$ presented in (Wisdom et al., 2016; Vorontsov et al., 2017).

The concept of a retraction is a relaxation of that of the Riemannian exponential map.

In other words, when $\mathcal{M}$ is a Riemannian manifold, $r$ is a first order approximation of the Riemannian exponential map.

It is clear that the exponential map is a retraction. For manifolds embedded in the Euclidean space with the metric induced by that of the Euclidean space, the following proposition gives us a simple way to construct a rather useful family of retractions—those used in projected gradient descent.

From $\pi$ being a surjective projection we have that $\pi(x)=x$ for every $x\in\mathcal{M}$ , which implies the first condition of the definition of retraction.

This proposition lets us see projected Riemannian gradient descent as an specific case of Riemannian gradient descent with a specific retraction. A corollary of this proposition that allows $r$ to be defined just in those vectors of the form $x+v$ with $(x,v)=T\mathcal{M}$ lets us construct specific examples of retractions:

Recall that for $A\in\operatorname{SO}\lparen n\rparen$ ,

then, for an element of $T\operatorname{SO}\lparen n\rparen$ we can define the map given by Proposition B.2. In this case, the projection $\pi(X)$ for a matrix with singular value decomposition $X=U\Sigma V^{\intercal}$ is $\pi(X)=UV^{\intercal}$ .

The two examples above are examples of orthogonal projections. The manifolds being considered are embedded into a Euclidean space and they inherit its metric. The projections here are orthogonal projections on the ambient space. On the other hand, Proposition B.2 does not require the projections to be orthogonal.

Different examples of retractions can be found in (Absil et al., 2009), Example $4.1.2$ .

Appendix C Comparing Riemannian gradient descent and the exponential parametrization

The proofs in this section are rather technical and general so that they apply to a wide variety of manifolds such as $\operatorname{SO}\lparen n\rparen$ , $\operatorname{U}\lparen n\rparen$ , or the symplectic group. Even though we do not explore applications of optimizing over other compact matrix Lie groups, we state the results in full generality. Having this in mind, we will first motivate this section with the concrete example that we study in the applications of this paper: $\operatorname{SO}\lparen n\rparen$ .

Riemannian gradient descent works by following the geodesic defined by the direction $-\operatorname{grad}f(B)$ at the point $B$ . In the words of the Riemannian exponential map at $B$ , if we have a learning rate $\eta>0$ , the update rule will be given by

The tangent space to $\operatorname{SO}\lparen n\rparen$ at a matrix $B$ is

and it is easy to check that the orthogonal projection with respect to the canonical metric onto this space is given by

Since the gradient on the manifold is just the tangent component of the gradient on the ambient space, we have that

Putting everything together, the Riemannian gradient descent update rule for $\operatorname{SO}\lparen n\rparen$ is given by

The other update rule that we have is the one given by the exponential parametrization

This is nothing but the gradient descent update rule applied to the problem

Both of these rules follow geodesic flows for two metrics whenever $A$ is in a neighborhood of the identity on which $\exp$ is a diffeomorphism (\cfAppendix D). A natural question that arises is whether these metrics are the same. A less restrictive question would be whether this second optimization procedure defines a retraction or whether their gradient flow is completely different.

In the sequel, we will see that for $\operatorname{SO}\lparen n\rparen$ these two optimization methods give raise to two rather different metrics. We will explicitly compute the quantity $\nabla\lparen f\circ\exp\rparen(A)$ , and we will give necessary and sufficient conditions equivalent under which these two optimization methods agree.

In this section we expose the theoretical part of the paper. The first part of this section is classic and can be found, for example, in Milnor (Milnor, 1976). We present it here for completeness. The results Proposition C.11 and Theorem C.12 are novel.

Throughout this section the operator $\lparen-\rparen^{\ast}$ will have two different meanings. It can be either the pullback of a form along a function or the adjoint of a linear operator on a vector space with an inner product. Although the two can be distinguished in many situations, we will explicitly mention to which one we are referring whenever it may not be clear from the context. Note that when we are on a matrix Lie group equipped with the product $\langle X,Y\rangle=\operatorname{tr}\lparen X^{\ast}Y\rparen$ , the adjoint of a linear operator is exactly its conjugate transpose, hence the notation.

We start by recalling the definition of our object of study.

A Riemannian metric on a Lie group $G$ is said to be left (resp. right) invariant if it makes left (resp. right) translations into isometries. Explicitly, it is so if for every $g\in G$ , and the metric $\alpha$ we have that $\alpha_{g}=L^{\ast}_{g^{-1}}\alpha_{e}$ (resp. $\alpha_{g}=R^{\ast}_{g^{-1}}\alpha_{e}$ ).

A bi-invariant metric is a metric that is both left and right-invariant.

We can construct a bi-invariant metric on a Lie group by using the averaging trick.

A compact Lie group $G$ admits a bi-invariant metric.

Let $n$ be the dimension of $G$ and let $\mu_{e}$ be a non-zero $n$ -form at $\mathfrak{g}$ . This form is unique up to a multiplicative constant. We can then extend it to the whole $G$ by pulling it back along $R_{g}$ defining $\mu_{g}:=R^{\ast}_{g^{-1}}\mu_{e}$ . This makes it into a right-invariant $n$ -form on the manifold, which we call the right Haar measure.

Let $\lparen\cdot,\cdot\rparen$ be an inner product on $\mathfrak{g}$ . We can turn this inner product into an $\operatorname{Ad}$ -invariant inner product on $\mathfrak{g}$ by averaging it over the elements of the group using the right Haar measure

Note that this integral is well defined since $G$ is compact. The $\operatorname{Ad}$ -invariance follows from the right-invariance of $\mu$

Finally, we can extend this product to the whole group by pulling back the inner product along $L_{g}$ , that is, if we denote the metric by $\alpha$ , $\alpha_{g}=L^{\ast}_{g^{-1}}\alpha_{e}$ . This automatically makes it into a left-invariant metric. But since it is $\operatorname{Ad}$ -invariant at the identity, we have that for every $g,h\in G$

and the metric is also right-invariant, finishing the proof. ∎

If the group is abelian, the construction above is still valid without the need of the averaging trick, since $\operatorname{Ad}$ is the identity map, so every inner product is automatically $\operatorname{Ad}$ -invariant.

It turns out that these examples and their products exhaust all the Lie groups that admit a bi-invariant metric. We include this result for completeness, even though we will not use it.

A Lie group admits a bi-invariant metric if and only if it is isomorphic to $G\times H$ with $G$ compact and $H$ abelian.

Lie groups, when equipped with a bi-invariant metric are rather amenable from the Riemannian geometry perspective. This is because it is possible to reduce many computations on them to matrix algebra, rather than the usual systems of differential equations that one encounters when dealing with arbitrary Riemannian manifolds.

The following proposition will come in handy later.

If an inner product on $\mathfrak{g}$ is $\operatorname{Ad}$ -invariant then

In other words, the adjoint of the map $\operatorname{ad}_{X}$ with respect to the inner product is $-\operatorname{ad}_{X}$ . We say that $\operatorname{ad}$ is skew-adjoint and we write $\operatorname{ad}^{\ast}_{X}=-\operatorname{ad}_{X}$ .

With this result in hand, we can prove a rather useful relation between the geometry of the Lie group and its algebraic structure.

Let $G$ be a Lie group equipped with a bi-invariant metric. If $X,Y$ are left-invariant vector fields, we have that their Levi-Civita connection and their sectional curvature are given by

The sectional curvature formula holds whenever $X$ and $Y$ are orthonormal.

For left-invariant vector fields $X,Y,Z$ , the Koszul formula gives

The three first terms on the right vanish, since the angle between invariant vector fields is constant. Reordering the last three terms, using Lemma C.5, the fact that the Lie bracket is antisymmetric, and since invariant vector fields form a basis of the Lie algebra, the formula for the connection follows. Now, the curvature tensor is given by

So the sectional curvature for $X,Y$ orthonormal is given by

On a matrix Lie group equipped with a metric we have three different notions of exponential maps, namely the Lie exponential map, the Riemannian exponential map and the exponential of matrices. We will now show that if we consider the Riemannian exponential map at the identity element, these three concepts agree whenever the metric is bi-invariant.

Let $G$ be a Lie group equipped with a bi-invariant metric. Then, the Riemannian exponential at the identity $\exp_{e}$ and the Lie exponential $\exp$ agree.

Fix a vector $X_{e}\in\mathfrak{g}$ and consider the curve $\gamma(t)=\exp(tX_{e})$ . This curve is the integral curve of the invariant vector field defined by $X_{e}$ , this is $\gamma^{\prime}(t)=X(\gamma(t))$ . For this reason, by Proposition C.6

so $\exp(tX_{e})$ is a geodesic and the result readily follows. ∎

The matrix exponential $\exp_{tX}$ can be expressed as the solution of the matrix differential equation

Finally, all these equivalences give a short proof of the fact that the Lie exponential map is surjective on a connected Lie group with a bi-invariant metric.

Let $G$ be a connected Lie group equipped with a bi-invariant metric. The Lie exponential is surjective.

Now that we have all the necessary tools, we shall return to the problem of studying the metric induced by the exponential parametrization.

As we have seen, the problem that we are interested in solving is

where $G$ is a matrix Lie group equipped with an bi-invariant metric. The exponential parametrization maps this problem back to the Lie algebra

Since $\mathfrak{g}$ is a vector space, putting a basis on it we have that we can use all the classical toolbox developed for Euclidean spaces to approach a solution for this problem. In particular, in the context of neural networks, we are interested in studying first-order optimization methods that approach a solution to this problem, in particular, gradient descent methods. The gradient descent update step for this problem with learning rate $\eta>0$ is given by

where the gradient is defined with respect to the metric, that is, it is the vector such that

To study this optimization method, we first have to make sense of the gradient $\nabla\lparen f\circ\exp\rparen(X)$ . To do so, we will make use of the differential of the exponential map.

The differential of the exponential map on a matrix Lie group is given by the formula

An analogous formula still holds in the general case, but the proof is more delicate. The powers of the adjoint representation are to be thought as the composition of endomorphisms on $\mathfrak{g}$ . For this reason, this formula can be also expressed as

Yet another way of looking at this expression is by defining the function

In this case, the fraction that defines $\phi$ is just a formal expression to refer to the formal series defined in Proposition C.10.

From this we can compute the gradient of $f\circ\exp$ .

Let $U\in\mathfrak{g}$ . By the chain rule, we have

In terms of the gradient of $f$ with respect to the metric this is equivalent to

Now we just have to compute the adjoint of the differential of the exponential function. This is now simple since

where the second equality follows from the product being left-invariant, the third one from $\phi$ being analytic and the last one from Lemma C.5. ∎

Now we can explicitly define the update rule for the exponential parametrization

We can then study the gradient flow induced by the exponential parametrization by means of $\widehat{r}$ . If $\widehat{r}$ were a retraction, then the flow induced by the exponential parametrization would have similar properties as that of Riemannian gradient descent, as shown in (Boumal et al., 2016). It turns out that the exponential parametrization induces a different flow.

Let $G$ be a connected matrix Lie group equipped with a bi-invariant metric. The function $\widehat{r}$ is a retraction if and only if $G$ is abelian.

It is clear that $\widehat{r}_{g}(0)=g$ for every $g\in G$ . Let $A\in\mathfrak{g}$ and $B=e^{A}$ and let $U\in T_{B}G$ . By the chain rule we have that

The map $\widehat{r}$ is a retraction if and only if

holds for every $U\in T_{B}G$ . This is equivalent to having

for every $H\in T_{B}G$ . Taking adjoints and since the metric is left-invariant, using the formula for the adjoint of the differential of the exponential map computed in Proposition C.11, or equivalently

In other words, $\widehat{r}$ is a retraction if and only if the Lie exponential map is a local isometry.

Now, the Lie exponential maps $\mathfrak{g}$ into $G$ , but $\mathfrak{g}$ equipped with its metric is flat, so it has constant sectional curvature zero. On the other hand, the sectional curvature of $G$ is given by $\kappa(X,Y)=\frac{1}{4}\lVert[X,Y]\rVert^{2}$ . Recall that a Lie group is abelian if and only if its Lie bracket is zero.

Conversely, if it is an isometry, it preserves the sectional curvature, so the Lie bracket has to be identically zero, hence the group is Abelian. ∎

In the abelian case, we do not only have that $\widehat{r}$ is a retraction, but also that the update rule for the exponential parametrization agrees with that of Riemannian gradient descent. Recall that the Riemannian gradient descent rule for a gradient $U$ and a step-size $\eta$ is given by

Appendix D Maximal Normal Neighborhood of the Identity

Let $V\subseteq T_{p}\mathcal{M}$ be a neighborhood of such that the Riemannian exponential map $\exp_{p}$ is a diffeomorphism. Then we say that $\exp_{p}V$ is a normal neighborhood of $p$ .

Using this function, we can factorize the differential of the exponential function on a matrix Lie group as

Let us now compute the maximal normal neighborhood of the identity. This result is classic, but the proof here is a simplification of the classical one using an approximation argument.

Let $G$ be a compact and connected matrix Lie group. The exponential function is analytic, with analytic inverse on a bounded open neighborhood of the origin given by

Given that $L_{e^{A}}$ is a diffemorphism, we are interested in studying when the matrix defined by the function $\phi\lparen\operatorname{ad}_{A}\rparen$ stops being full-rank and when is it injective.

Let us look at the particular cases that we are interested in. If we set $G=\operatorname{SO}\lparen n\rparen$ , we have that its Lie algebra are the skew-symmetric matrices. Skew-symmetric matrices have purely imaginary eigenvalues. Furthermore, since they are real matrices, their eigenvalues come in conjugate pairs. As such, we have that the exponential map is singular on every matrix in the boundary of the set $U$ defined in Theorem D.2.

Special orthogonal matrices are those matrices which are similar to block-diagonal matrices with diagonal blocks of the form

for $\theta\in(-\pi,\pi]$ . On $\operatorname{SO}\lparen 2n+1\rparen$ , there is an extra block with a single $1$ .

Similarly, skew-symmetric matrices are those matrices which are similar to block-diagonal matrices with diagonal blocks of the form

On $\mathfrak{so}\lparen 2n+1\rparen$ there is an extra block with a single zero.

This outlines an elementary proof of the fact that the Lie exponential is surjective on $\operatorname{SO}\lparen n\rparen$ and $\operatorname{U}\lparen n\rparen$ .

In both cases this shows that the boundary of $U$ has measure zero and that $f(\overline{U})=G$ .

The reader familiar with Lie group theory will have noticed that this proof is exactly the standard one for the surjectivity of the exponential map using the Torus theorem, where one proves that all the maximal tori in a compact Lie group are conjugated and that every element of the group lies in some maximal torus, arriving then to the same conclusion but in much more generality.

Appendix E Hyperparameters for the Experiments

The batch size across all the experiments was $128$ . The learning rates for the orthogonal parameters are $10$ times less those for the non-orthogonal parameters. We fixed the seed of both Numpy and Pytorch to be $5544$ for all the experiments for reproducibility. This is the same seed that was used in the experiments in (Helfrich et al., 2018). In Table 3 we refer to the optimizer and learning rate for the non-orthogonal part of the neural network simply as optimizer and learning rate.