Lipschitz Recurrent Neural Networks

N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, Michael W. Mahoney

Introduction

Many interesting problems exhibit temporal structures that can be modeled with recurrent neural networks (RNNs), including problems in robotics, system identification, natural language processing, and machine learning control. In contrast to feed-forward neural networks, RNNs consist of one or more recurrent units that are designed to have dynamical (recurrent) properties, thereby enabling them to acquire some form of internal memory. This equips RNNs with the ability to discover and exploit spatiotemporal patterns, such as symmetries and periodic structures (Hinton, 1986). However, RNNs are known to have stability issues and are notoriously difficult to train, most notably due to the vanishing and exploding gradients problem (Bengio et al., 1994; Pascanu et al., 2013).

Several recurrent models deal with the vanishing and exploding gradients issue by restricting the hidden-to-hidden weight matrix to be an element of the orthogonal group (Arjovsky et al., 2016; Wisdom et al., 2016; Mhammedi et al., 2017; Vorontsov et al., 2017; Lezcano-Casado & Martinez-Rubio, 2019). While such an approach is advantageous in maintaining long-range memory, it limits the expressivity of the model. To address this issue, recent work suggested to construct hidden-to-hidden weights which have unit norm eigenvalues and can be nonnormal (Kerg et al., 2019). Another approach for resolving the exploding/vanishing gradient problem has recently been proposed by Kag et al. (2020), who formulate the recurrent units as a differential equation and update the hidden states based on the difference between predicted and previous states.

Treating RNNs as dynamical systems enables studying the long-term behavior of the hidden state with tools from stability analysis. From this point of view, an unstable unit presents an exploding gradient problem, while a stable unit has well-behaved gradients over time (Miller & Hardt, 2019). However, a stable recurrent unit can suffer from vanishing gradients, leading to catastrophic forgetting (Hochreiter & Schmidhuber, 1997b). Thus, we opt for a stable model whose dynamics do not (or only slowly do) decay over time. Importantly, stability is also a statement about the robustness of neural units with respect to input perturbations, i.e., stable models are less sensitive to small perturbations compared to unstable models. Recently, Chang et al. (2019) explored the stability of linearized RNNs and provided a local stability guarantee based on the Jacobian. In contrast, the particular structure of our unit (1) allows us to obtain guarantees of global exponential stability using control theoretical arguments. In turn, the sufficient conditions for global stability motivate a novel symmetric-skew decomposition based scheme for constructing hidden-to-hidden matrices. This scheme alleviates exploding and vanishing gradients, while remaining highly expressive.

In summary, the main contributions of this work are as follows:

First, in Section 3, using control theoretical arguments in a direct Lyapunov approach, we provide sufficient conditions for global exponential stability of the Lipschitz RNN unit (Theorem 1). Global stability is advantageous over local stability results since it guarantees non-exploding gradients regardless of the state. In the special case where $A$ is symmetric, we find that these conditions agree with those in classical theoretical analyses (Lemma 1).

Next, in Section 4, drawing from our stability analysis, we propose a novel scheme based on the symmetric-skew decomposition for constructing hidden-to-hidden matrices. This scheme mitigates the vanishing and exploding gradients problem, while obtaining highly expressive hidden-to-hidden matrices.

In Section 6, we show that our Lipschitz RNN has the ability to outperform state-of-the-art recurrent units on computer vision, language modeling and speech prediction tasks. Further, our results show that the higher-order explicit midpoint time integrator improves the predictive accuracy as compared to using the simpler one-step forward Euler scheme.

Finally, in Section 7), we study our Lipschitz RNN via the lens of the Hessian and show that it is robust with respect to parameter perturbations; we also show that our model is more robust with respect to input perturbations, compared to other continuous-time RNNs.

Related Work

The problem of vanishing and exploding gradients (and stability) have a storied history in the study of RNNs. Below, we summarize two particular approaches to the problem (constructing unitary/orthogonal RNNs and the dynamical systems viewpoint) that have gained significant attention.

Unitary and orthogonal RNNs. Unitary recurrent units have received attention recently, largely due to Arjovsky et al. (2016) showing that unitary hidden-to-hidden matrices alleviate the vanishing and exploding gradients problem. Several other unitary and orthogonal models have also been proposed (Wisdom et al., 2016; Mhammedi et al., 2017; Jing et al., 2017; Vorontsov et al., 2017; Jose et al., 2018). While these approaches stabilize the training process of RNNs considerably, they also limit their expressivity and their prediction accuracy. Further, unitary RNNs are expensive to train, as they typically involve the computation of a matrix inverse at each step of training. Recent work by Lezcano-Casado & Martinez-Rubio (2019) overcame some of these limitations. By leveraging concepts from Riemannian geometry and Lie group theory, their recurrent unit exhibits improved expressivity and predictive accuracy on a range of benchmark tasks while also being efficient to train. Another competitive recurrent design was recently proposed by Kerg et al. (2019). Their approach is based on the Schur decomposition, and it enables the construction of general nonnormal hidden-to-hidden matrices with unit-norm eigenvalues.

Dynamical systems inspired RNNs. The continuous time view of RNNs has a long history in the neurodynamics community as it provides higher flexibility and increased interpretability (Pineda, 1988; Pearlmutter, 1995; Zhang et al., 2014). In particular, RNNs that are governed by differential equations with an additive structure have been extensively studied from a theoretical point of view (Funahashi & Nakamura, 1993; Kim et al., 1996; Chow & Li, 2000; Hu & Wang, 2002; Li et al., 2005; Trischler & D’Eleuterio, 2016). See Zhang et al. (2014) for a comprehensive survey of continuous-time RNNs and their stability properties.

Recently, several works have adopted the dynamical systems perspective to alleviate the challenges of training RNNs which are related to the vanishing and exploding gradients problem. For non-sequential data, Ciccone et al. (2018) proposed a negative-definite parameterization for enforcing stability in the RNN during training. Chang et al. (2019) introduced an antisymmetric hidden-to-hidden weight matrix and provided guarantees for local stability. Kag et al. (2020) have proposed a differential equation based formulation for resolving the exploding/vanishing gradients problem by updating the hidden states based on the difference between predicted and previous states. Niu et al. (2019) employed numerical methods for differential equations to study the stability of RNNs.

Another line of recent work has focused on continuous-time models that deal with irregular sampled time-series, missing values and multidimensional time series. Rubanova et al. (2019) and De Brouwer et al. (2019) formulated novel recurrent models based on the theory of differential equations and their discrete integration. Lechner & Hasani (2020) extended these ordinary differential equation (ODE) based models and addresses the issue of vanishing and exploding gradients by designing an ODE-model that is based on the idea of long short-term memory (LSTM). This ODE-LSTM outperforms the continuous-time LSTM (Mei & Eisner, 2017) as well as the GRU-D model (Che et al., 2018) that is based on a gated recurrent unit (GRU).

The link between dynamical systems and models for forecasting sequential data also provides the opportunity to incorporate physical knowledge into the learning process which improves the generalization performance, robustness, and ability to learn with limited data (Chen et al., 2019).

Stability Analysis of Lipschitz Recurrent Units

One of the key contributions in this work is that we prove that model (1) is globally exponentially stable under some mild conditions on $A$ and $W$ . Namely, for any initial hidden state we can guarantee that our Lipschitz unit converges to an equilibrium if it exists, and therefore, gradients can never explode. We improve upon recent work on stability in recurrent models, which provide only a local analysis, see e.g., (Chang et al., 2019). In fact, global exponential stability is among the strongest notions of stability in nonlinear systems theory, implying all other forms of Lyapunov stability about the equilibrium $h^{*}$ (Khalil, 2002, Definitions 4.4 and 4.5).

The presence of a Lipschitz nonlinearity in (1) plays an important role in our analysis. While we focus on $\tanh$ in our experiments, our proof is more general and is applicable to models whose nonlinearity $\sigma(\cdot)$ is an $M$ -Lipschitz function. Specifically, we consider the general model

for which we have the following stability result. In the following, we let $\sigma_{\min}$ and $\sigma_{\max}$ denote the smallest and largest singular values of the hidden-to-hidden matrices, respectively.

The two cases show that global exponential stability is guaranteed if either (a) the matrix $A$ has eigenvalues with real parts sufficiently negative to counteract expanding trajectories in the nonlinearity; or (b) the nonlinearity is monotone, both $A$ and $W$ yield stable linear systems $\dot{u}=Au$ , $\dot{v}=Wv$ , and $A,W$ have sufficiently similar eigenvectors. In practice, case (b) occasionally holds, but is challenging to ensure without assuming specific structure on $A$ , $W$ . Because such assumptions could limit the expressiveness of the model, the next section will develop a tunable formulation for $A$ and $W$ with the capacity to ensure that case (a) holds.

In Appendix A.1, we provide a proof of Theorem 1 using a direct Lyapunov approach. One advantage of this approach is that the driving input $x$ is permitted to evolve in time arbitrarily in the analysis. The proof relies on the classical Kalman-Yakubovich-Popov lemma and circle criterion from control theory — to our knowledge, these tools have not been applied in the modern RNN literature, and we hope our proof can illustrate their value to the community.

In the special case where $A$ is symmetric and $x(t)$ constant, we show that we can also inherit criteria for both local and global stability from a class of well-studied Cohen–Grossberg–Hopfield models.

For a thorough review of analyses of (5), see (Zhang et al., 2014). In this special case, the criteria in Theorem 1 coincide with those obtained for the corresponding model (5). However, in practice, we will not choose $A$ to be symmetric.

Symmetric-Skew Hidden-to-Hidden Matrices

In this section we propose a novel scheme for constructing hidden-to-hidden matrices. Specifically, based on the successful application of skew-symmetric hidden-to-hidden weights in several recent recurrent architectures, and our stability criteria in Theorem 1, we propose an effective symmetric-skew decomposition for hidden matrices. Our decomposition allows for a simple control of the matrix spectrum while retaining its wide expressive range, enabling us to satisfy the spectral constraints derived in the previous section on both $A$ and $W$ . The proposed scheme also accounts for the issue of vanishing gradients by reducing the magnitude of large negative eigenvalues.

To address the expressivity issue, we aim for hidden matrices which on the one hand, allow to control the expansion and shrinkage of their associated trajectories, and on the other hand, will be sampled from a superset of the skew-symmetric matrices. Our analysis in Theorem 1 guarantees that Lipschitz recurrent units maintain non-expanding trajectories under mild conditions on $A$ and $W$ . Unfortunately, this proposition does not provide any information with respect to the shrinkage of paths. Here, we opt for a system whose expansion and shrinkage can be easily controlled. Formally, the latter requirement is equivalent to designing hidden weights $S$ with small $\mathcal{R}\lambda_{i}(S),\;i=1,2,\dots,N$ , where $\mathcal{R}(z)$ denotes the real part of $z$ . A system of the form (4) whose matrices $A$ and $W$ exhibit small spectra and satisfy the conditions of Theorem 1, will exhibit dynamics with moderate decay and growth behavior and alleviate the problem of exploding and vanishing gradients. To this end, we propose the following symmetric-skew decomposition for constructing hidden matrices:

where $M$ is a weight matrix, and $\beta\in[0.5,1]$ , $\gamma>0$ are tuning parameters. In the case $(\beta,\gamma)=(1,0)$ , we recover a skew-symmetric matrix, i.e., $S_{1,0}+S_{1,0}^{T}=0$ . The construction $S_{\beta,\gamma}$ is useful as we can easily bound its spectrum via the parameters $\beta$ and $\gamma$ , as we show in the next proposition.

A proof is provided in Appendix A.2. We infer that $\beta$ controls the width of the spectrum, while increasing $\gamma$ shifts the spectrum to the left along the real axis, thus enforcing eigenvalues with non-positive real parts. Choosing our hidden-to-hidden matrices to be $A_{\beta_{A},\gamma_{A}}$ and $W_{\beta_{W},\gamma_{W}}$ of the form (6) for different values of $\beta_{A},\beta_{W}$ and $\gamma_{A},\gamma_{W}$ , we can ensure small spectra and satisfy the conditions of Theorem 1 as desired. Note, that different tuning parameters $\beta$ and $\gamma$ affect the stability behavior of the Lipschitz recurrent unit. This is illustrated in Figure 1, where different values for $\beta$ and $\gamma$ are used to construct both $A_{\beta,\gamma}$ and $W_{\beta,\gamma}$ and applied to learning simple pendulum dynamics.

Therefore, provided both the initial and optimal parameters lie within the stability region, the model parameters will remain in the stability region for longer periods of time with high probability as $\beta\to 1$ . Further empirical evidence of parameters often remaining in the stability region during training are provided alongside the proof of Lemma 2 in the Appendix (see Figure 5).

Training Continuous-time Lipschitz Recurrent Units

ODEs such as Eq. (1) can be approximately solved by employing numerical integrators. In scientific computing, numerical integration is a well studied field that provides well understood techniques (LeVeque, 2007). Recent literature has also introduced new approaches which are designed with neural network frameworks in mind (Chen et al., 2018).

We consider both the explicit (forward) Euler scheme,

as well as the midpoint method which is a two-stage explicit Runge-Kutta scheme (RK2),

Empirical Evaluation

In this section, we evaluate the performance of the Lipschitz RNN and compare it to other state-of-the-art methods. The model is applied to ordered and permuted pixel-by-pixel MNIST classification, as well as to audio data using the TIMIT dataset. We show the sensitivity with respect to to random initialization in Appendix B. Appendix B also contains additional results for: pixel-by-pixel CIFAR-10 and a noise-padded version of CIFAR-10; as well as for character level and word level prediction using the Penn Tree Bank (PTB) dataset. All of these tasks require that the recurrent unit learns long-term dependencies: that is, the hidden-to-hidden matrices need to have sufficient memory to remember information from far in the past.

The pixel-by-pixel MNIST task tests long range dependency by sequentially presenting $784$ pixels to the recurrent unit, i.e., the RNN processes one pixel at a time (Le et al., 2015). At the end of the sequence, the learned hidden state is used to predict the class membership probability of the input image. This task requires that the RNN has a sufficient long-term memory in order to discriminate between different classes. A more challenging variation to this task is to operate on a fixed random permutation of the input sequence.

Table 1 provides a summary of our results. The Lipschitz RNN, with hidden dimension of $N=128$ and trained with the forward Euler and RK2 scheme, achieves $99.4\%$ and $99.3\%$ accuracy on the ordered pixel-by-pixel MNIST task. For the permuted task, the model trained with forward Euler achieves $96.3\%$ accuracy, whereas the model trained with RK2 achieves $96.2\%$ accuracy. Hence, our Lipschitz recurrent unit outperforms state-of-the-art RNNs on both tasks and is competitive even when a hidden dimension of $N=64$ is used, however, it can be seen that a larger unit with more capacity is advantageous for the permuted task. Our results show that we significantly outperform the Antisymmetric RNN (Chang et al., 2019) on the ordered tasks, while using fewer weights. That shows that the antisymmetric weight paramterization is limiting the expressivity of the recurrent unit. The exponential RNN is the next most competitive model, yet this model requires a larger hidden-to-hidden unit to perform well on the two considered tasks.

2 TIMIT

Next, we consider the TIMIT dataset (Garofolo, 1993) to study the capabilities of the Lipschitz RNN for speech prediction using audio data. For our experiments, we used the publicly available implementation of this task by Lezcano-Casado & Martinez-Rubio (2019). This implementation applies the preprocessing steps suggested by Wisdom et al. (2016): (i) downsample each audio sequence to 8kHz; (ii) process the downsampled sequences with a short-time Fourier transform using a Hann window of 256 samples and a window hop of 128 samples; and (iii) normalize the log-magnitude of the Fourier amplitudes. We obtain a set of frames that each have 129 complex-valued Fourier amplitudes and the task is to predict the log-magnitude of future frames. To compare our results with those of other models, we used the common train / validation / test split: 3690 utterances from 462 speakers for training, 192 utterances for validation, and 400 utterances for testing.

Table 2 lists the results for the Lipschitz recurrent unit as well as for several benchmark models. It can be seen that the Lipschitz RNN outperforms other state-of-the-art models for a fixed number of parameters ( $\approx 200$ K). In particular, LSTMs do not perform well on this task, however, the recently proposed momentum based LSTMs (Nguyen et al., 2020) have improvemed performance. Interestingly, the RK2 scheme leads to a better performance since this scheme provides more accurate approximations for the intermediate states.

Robustness with Respect to Perturbations

Eigenanalysis of the Hessian provides a tool for studying various aspects of neural networks (Hochreiter & Schmidhuber, 1997a; Sagun et al., 2017; Ghorbani et al., 2019). Here, we study the Hessian $H$ spectrum with respect to the model parameters of the recurrent unit using PyHessian (Yao et al., 2019). The Hessian provides us with insights about the curvature of the loss function $\mathcal{L}$ . This is because the Hessian is defined as the derivatives of the gradients, and thus the Hessian eigenvalues describe the change in the gradient of $\mathcal{L}$ as we take an infinitesimal step into a given direction. The eigenvectors span the (local) surface of the loss function at a given point, and the corresponding eigenvalue determines the curvature in the direction of the eigenvectors. This means that larger eigenvalues indicate a larger curvature, i.e., greater sensitivity, and the sign of the eigenvalues determines whether the curvature will be positive or negative.

To demonstrate the advantage of the additional linear term and our weight parameterization, we compare the Lipschitz RNN to two other continuous-time recurrent units. First, we consider a simple neural ODE RNN (Rubanova et al., 2019) that takes the form

where $W$ is a simple hidden-to-hidden matrix. As a second model we consider the antisymmetric RNN (Chang et al., 2019), that takes the same form as (11), but uses a skew-symmetric scheme to parameterize the hidden-to-hidden matrix as $W:=(M-M^{T})-\gamma I$ , where $M$ is a trainable weight matrix and $\gamma$ is a tunable parameter.

Further, Figure 2 shows the results for white noise and salt and pepper noise. It can be seen that the Lipschitz unit is less sensitive to input perturbations, as compared to the simple neural ODE RNN, and the antisymmetric RNN. In addition, we also show the results for an unitary RNN here.

The performance of the Lipschitz recurrent unit is due to two main innovations: (i) the additional linear term; and (ii) the scheme for constructing the hidden-to-hidden matrices $A$ and $W$ in Eq. (6). Thus, we investigate the effect of both innovations, while keeping all other conditions fixed. More concretely, we consider the following ablation recurrent unit

where $\alpha$ controls the effect of the linear hidden unit. Both $A$ and $W$ depend on the parameters $\beta$ , $\gamma$ .

Figure 3(a) studies the effect of the linear hidden unit, with $\beta=0.65$ for the ordered task and $\beta=0.8$ for the permuted task. In both cases we use $\gamma=0.001$ . It can be seen that the test accuracies of both the ordered and permuted pixel-by-pixel MNIST tasks clearly depend on the linear hidden unit. For $\alpha=0$ , our models reduces to simple neural ODE recurrent units (Eq. (11)). The recurrent unit degenerates for $\alpha>1.6$ , since the external input is superimposed by the hidden state. Figure 3(b) studies the effect of the hidden-to-hidden matrices with respect to $\beta$ . It can be seen that $\beta=\{0.65,0.70\}$ achieves peak performance for the ordered task, and $\beta=\{0.8,0.85\}$ does so for the permuted task. Note that $\beta=1.0$ recovers an skew-symmetric hidden-to-hidden matrix.

Conclusion

Viewing RNNs as continuous-time dynamical systems with input, we have proposed a new Lipschitz recurrent unit that excels on a range of benchmark tasks. The special structure of the recurrent unit allows us to obtain guarantees of global exponential stability using control theoretical arguments. In turn, the insights from this analysis motivated the symmetric-skew decomposition scheme for constructing hidden-to-hidden matrices, which mitigates the vanishing and exploding gradients problem. Due to the nice stability properties of the Lipschitz recurrent unit, we also obtain a model that is more robust with respect to input and parameter perturbations as compared to other continuous-time units. This behavior is also reflected by the Hessian analysis of the model. We expect that the improved robustness will make Lipschitz RNNs more reliable for sensitive applications. The theoretical results for our symmetric-skew decomposition of parameterizing hidden-to-hidden matrices also directly extend to the convolutional setting. Future work will explore this extension and study the potential advantages of these more parsimonious hidden-to-hidden matrices in combination with our parameterization in practice. Research code is shared via github.com/erichson/LipschitzRNN.

We would like to thank Ed H. Chi for fruitful discussions about physics-informed machine learning and the Antisymmetric RNN. We are grateful to the generous support from Amazon AWS and Google Cloud. NBE and MWM would like to acknowledge IARPA (contract W911NF20C0035), NSF, ONR and CLTC for providing partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred.

References

Appendix A Proofs

There are numerous ways that one can analyze the global stability of (4) through the related model (5), many of which are discussed in Zhang et al. (2014). Instead, here we shall conduct a direct approach and avoid appealing to diagonalization in order to obtain cleaner conditions, and a more straightforward proof that readily applies in the time-inhomogeneous setting.

Our method of choice relies on Lyapunov arguments summarized in the following theorem, which can be found as (Khalil, 2002, Theorem 4.10). For more details on related Lyapunov theory, see also Hahn (1967); Sastry (2013).

for some constants $k_{1},k_{2},k_{3},\alpha>0$ . and $\dot{V}(h)<0$ for $h\neq h^{\ast}$ .

The construction of the Lyapunov function $V$ that satisfies the conditions of Theorem 2 is accomplished using the Kalman-Yakubovich-Popov lemma, which is a statement regarding strictly positive real transfer functions. We use the following definition, equivalent to other standard definitions by (Khalil, 2002, Lemma 6.1).

The poles of $G(s)$ have negative real parts.

The following is presented in (Khalil, 2002, Lemma 6.3).

if and only if the transfer function $G(s)=W(sI-A)^{-1}$ is strictly positive real. In this case, we may take $\epsilon=2\mu$ , where $\mu>0$ is chosen so that $G(s-\mu)$ remains strictly positive real.

A shorter proof for case (a) is available to us through the (multivariable) circle criterion — the following theorem is a corollary of (Khalil, 2002, Theorem 7.1) suitable for our purposes.

is globally exponentially stable towards an equilibrium at the origin if $\|\psi(t,y)\|\leq M\|y\|$ for some $M>0$ and $Z(s)=[I+MG(s)][I-MG(s)]^{-1}$ is strictly positive real, where $G(s)=W(sI-A)^{-1}$ .

Both the Kalman-Yakubovich-Popov lemma and the circle criterion are classical results in control theory, and are typically discussed in the setting of feedback systems (Khalil, 2002, Chapter 6, 7). Our presentation here is less general than the complete formulation, but makes clearer the connection to RNNs. With these tools, we state our proof of Theorem 1.

To begin, we shall center the differential equation about the equilibrium. By assumption, there exists $h^{\ast}$ such that $Ah^{\ast}=-\sigma(Wh^{\ast}+Ux(t)+b)$ . Letting $\bar{h}=h-h^{\ast}$ , we find that

It will suffice to show that (13) is globally exponentially stable at the origin.

Let us begin with case (a). The proof follows arguments analogous to (Khalil, 2002, Example 7.1). Let $G(s)=W(A-sI)^{-1}$ denote the transfer function for the system (13). Letting

From the Fan-Hoffman inequality (Bhatia, 2013, Proposition III.5.1), we have that

and since $A+A^{T}$ is negative definite, for any $s$ with $\Re(s)\geq 0$ ,

by assumption. Finally, since $Z(\infty)+Z(\infty)^{T}=2I$ is positive definite, $Z(s)$ is strictly positive real and Theorem 3 applies.

Since the inner matrix factor is Hermitian, Sylvester’s law of inertia implies that $G(i\omega)+G(-i\omega)^{T}$ is positive definite if and only if

is positive definite. Since $B_{\omega}$ is a Hermitian matrix, it has real eigenvalues, with minimal eigenvalue given by the infimum of the Rayleigh quotient:

By assumption, $A^{T}W+W^{T}A$ has strictly positive eigenvalues, and hence $B_{\omega}$ and $G(i\omega)+G(-i\omega)^{T}$ are positive definite. Therefore, Lemma 3 applies, and we obtain matrices $P,L,U$ and a constant $\epsilon>0$ with the corresponding properties.

Now we may construct our Lyapunov function $V$ . Let $v=W\bar{h}$ and

so that $\dot{\bar{h}}=A\bar{h}+u$ . Since $\sigma$ is monotone non-decreasing, $\sigma(x)-\sigma(y)\geq 0$ for any $x\geq y$ . This implies that for each $i=1,\dots,N$ , $v_{i}$ and $u_{i}$ have the same sign. In particular, $v^{T}u\geq 0$ . Now, let $V(h)=h^{T}Ph$ be our Lyapunov function, noting that $V$ is independent of $t$ . Taking the derivative of the Lyapunov function over (13) and using the properties of $P,L,U,\epsilon$ ,

Since $v^{T}u\geq 0$ and $(L\bar{h}+Uu)^{T}(L\bar{h}+Uu)\geq 0$ , it follows that $\dot{V}(\bar{h})\leq-\epsilon\lambda_{\min}(P)\|h\|^{2}$ , and hence global exponential stability follows from Theorem 2 and positive-definiteness of $P$ . ∎

To finish off discussion regarding the results from Sec. 3, we provide a quick proof of Lemma 1 using a simple diagonalization argument.

Since $A$ is symmetric and real-valued, by (Horn & Johnson, 2012, Theorem 4.1.5), there exists an orthogonal matrix $P$ and a real diagonal matrix $D$ such that $A=PDP^{T}$ . Letting $z=P^{T}h$ where $h$ satisfies (4), since $h=Pz$ , we see that

Therefore, $z$ satisfies (5) with $L=P^{T}$ and $V=WP$ , both of which are nonsingular by orthogonality of $P$ . By the same argument, for any equilibrium $h^{\ast}$ , taking $z^{\ast}=P^{T}h^{\ast}$ ,

implying that $z^{\ast}$ is an equilibrium of (5). Furthermore, since

from orthogonality of $P$ . Because every form of Lyapunov stability, both local and global, including global exponential stability, depend only on the norm $\|h-h^{\ast}\|$ (Khalil, 2002, Definitions 4.4 and 4.5), $h^{\ast}$ is stable under any of these forms if and only if $z^{\ast}$ is also stable. ∎

We remark that the proof of Lemma 1 can extend to matrices $A$ which have real eigenvalues and are diagonalizable. These attributes are implied for symmetric matrices. However, they can be difficult to ensure in practice for nonsymmetric matrices without imposing difficult structural constraints.

A.2 Proof of Proposition 1

The proof of Proposition 1 relies on the following lemma, which we also have made use of several times throughout this work.

Recall by the min-max theorem, for $\langle u,v\rangle=u^{\ast}v$ , where $u^{\ast}$ is the conjugate transpose of $u$ , the upper and lower eigenvalues of $A+A^{T}$ satisfy

Let $\lambda_{i}(A)=u+i\omega$ be an eigenvalue of $A$ with corresponding eigenvector $v$ satisfying $\|v\|=1$ . Since $Av=(u+i\omega)v$ ,

Hence, $\lambda_{\min}(A+A^{T})\leq u\leq\lambda_{\max}(A+A^{T})$ . ∎

as required. Finally, if $\beta=1$ , then (16) still holds, since both intervals collapse to the single point $\{-\gamma\}$ . ∎

Figure 4 illustrates the effect of $\beta$ onto the eigenvalues of $A_{\beta,\gamma}$ with the largest and smallest real parts. It can be seen, both empirically and theoretically, that the real part of the eigenvalues converges towards zero as $\beta$ tends towards one, i.e., we yield a skew-symmetric matrix with purely imaginary eigenvalues in the limit. Thus, for a sufficiently large parameter $\beta$ we yield a system that approximately preserves an “energy” for a limited time-horizon

A.3 Proof of Lemma 2

First, it follows from Gronwall’s inequality that the norm of the final hidden state $\|h(T)\|$ is bounded uniformly in $\beta$ . From Weyl’s inequalities and the definition of $A_{\beta,\gamma}$ ,

By the chain rule, for each element $M_{A}^{ij}$ of the matrix $M_{A}$ ,

Now, for any collection of parameters $\theta_{i}$ ,

Since $\frac{\partial(M_{A}h)}{\partial M_{A}^{ij}}=\frac{\partial(M_{A}^{T}h)}{\partial M_{A}^{ji}}$ , it follows that

In Figure 5, we plot the most positive real part of the eigenvalues of $A_{\beta,\gamma}$ and $W_{\beta,\gamma}$ during training for the ordered MNIST task. As $\beta$ increases, the eigenvalues change less during training, remaining in the stability region provided by case (b) of Theorem 1 for more of the training time.

Appendix B Additional Experiments

The hidden matrices are initialized by sampling weights from the normal distribution $\mathcal{N}(0,\sigma)$ , where $\sigma$ is the variance, which can be treated as a tuning parameter. In our experiments we typically chose a small $\sigma$ ; see the Table 8 for details. To show that the Lipschitz RNN is insensitive to random initialization, we have trained each model with 10 different seeds. Table 4 shows the maximum, average and minimum values obtained for each task. Note that higher values indicate better performance on the ordered and permuted MNIST tasks, while lower values indicate better performance on the TIMIT task.

B.2 Ordered Pixel-by-Pixel and Noise-Padded CIFAR-10

The pixel-by-pixel CIFAR-10 benchmark problem that has recently been proposed by (Chang et al., 2019). This task is similar to the pixel-by-pixel MNIST task, yet more challenging due to the increased sequence length and the more difficult classification problem. Similar to MNIST, we flatten the CIFAR-10 images to construct a sequence of length $1024$ in scanline order, where each element of the sequence consists of three pixels (one from each channel).

Table 5 provides a summary of our results. Our Lipschitz recurrent unit outperforms both the incremental RNN (Kag et al., 2020) and the antisymmetric RNN (Chang et al., 2019) by a significant margin. This impressively demonstrates that the Lipschitz unit enables the stable propagation of signals over long time horizons.

B.3 Penn Tree Bank (PTB)

Next, we consider a character level language modeling task using the Penn Treebank Corpus (PTB) (Marcus et al., 1993). Specifically, this task studies how well a model can predict the next character in a sequence of text. The dataset is composed of a train / validation / test set, where $5017$ K characters are used for training, $393$ K characters are used for validation and $442$ K characters are used for testing. For our experiments, we used the publicly available implementation of this task by Kerg et al. (2019), which computes the performance in terms of mean bits per character (BPC).

Table 6 shows the results for back-propagation through time (BPTT) over 150 and 300 time steps, respectively. The Lipschitz RNN performs slightly better then the exponential RNN and the non-normal RNN on this task. (Kerg et al., 2019) notes that orthogonal hidden-to-hidden matrices are not particular well-suited for this task. Thus, it is not surprising that the Lipschitz unit has a small advantage here.

For comparison, we have also tested the Antisymmetric RNN (Chang et al., 2019) on this task. The performance of this unit is considerably weaker as compared to our Lipschitz unit. This suggests that the Lipschitz RNN is more expressive and improves the propagation of meaningful signals over longer time scales.

B.3.2 Word-Level Prediction

In addition to character-level prediction, we also consider word-level prediction using the PTB corpus. For comparison with other state-of-the-art units, we consider the setup by Kusupati et al. (2018), who use a sequence length of $300$ . Table 7 shows results for back-propagation through time (BPTT) over 300 time steps. The Lipschitz RNN performs slightly better than the other RNNs on this task and the baseline LSTM for the test perplexity metric reported by Kusupati et al. (2018).

Appendix C Tuning Parameters

For tuning we utilized a standard training procedure using a non-exhaustive random search within the following plausible ranges for the our weight parameterization $\beta={0.65,0.7,0.75,0.8}$ , $\gamma=[0.001,1.0]$ . For Adam we explored learning rates between 0.001 and 0.005, and for SGD we considered 0.1. For the step size we explored values in the range 0.001 to 1.0. We did not perform an automated grid search and thus expect that the models can be further fine-tuned.

The tuning parameters for the different tasks that we have considered are summarized in Table 8.

For pixel-by-pixel MNIST and CIFAR-10, we use Adam for minimizing the objective. We train all our models for $100$ epochs, with scheduled learning rate decays at epochs $\{90\}$ . We do not use gradient clipping during training. Figure 6 shows the test accuracy curves for our Lipschitz RNN for the ordered and permuted MNIST classification tasks.

For TIMIT we use Adam with default parameters for minimizing the objective. We also tried Adam using betas (0.0, 0.9) as well as RMSprop with $\alpha=0.9$ , however, Adam with default values worked best in our experiments. We train the model for $1200$ epochs without learning-rate decay. Similar to Kerg et al. (2019) we train our model with gradient clipping, however, we observed that the performance of our model is relatively insensitive to the clipping value.

For the character level prediction task, we use Adam with default parameters for minimizing the objective, while we use RMSprop with $\alpha=0.9$ for the word level prediction task. We train the model for $200$ epochs for the character-level task, and for 500 epochs for the word-level task.