Resurrecting Recurrent Neural Networks for Long Sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, Soham De

Introduction

Recurrent neural networks (RNNs) have played a central role since the early days of deep learning, and are a natural choice when modelling sequential data (McCulloch and Pitts, 1943; Hopfield, 1982; Rumelhart et al., 1985; Elman, 1990). However, while these networks have strong theoretical properties, such as Turing completeness (Kilian and Siegelmann, 1996; Chung and Siegelmann, 2021), it is well-known that they can be hard to train in practice. In particular, RNNs suffer from the vanishing and exploding gradient problem (Hochreiter, 1991; Bengio et al., 1994; Pascanu et al., 2013), which makes it difficult for these models to learn about the long-range dependencies in the data. Several techniques were developed that attempt to mitigate this issue, including orthogonal/unitary RNNs (Arjovsky et al., 2016; Helfrich et al., 2018), and gating mechanisms such as long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) and gated recurrent units (GRUs) (Cho et al., 2014a). Nonetheless, these models are still slow to optimize due to the inherently sequential nature of their computation (Kalchbrenner et al., 2016), and are therefore hard to scale.

In recent years, Transformers (Vaswani et al., 2017) have gained increasing prominence for sequence modelling tasks, achieving remarkable success in a wide range of applications (Brown et al., 2020; Dosovitskiy et al., 2020; Jumper et al., 2021). Compared to RNNs, attention layers are easier to scale and parallelize during training, and crucially they do not suffer from the vanishing gradient problem, since the interaction between any two tokens in the sequence is modeled by direct edges in the network. A key issue with attention layers however is that their computational and memory costs scale quadratically as O(L2)O(L^{2}) with the sequence length LL. Transformers can therefore be especially expensive to deploy on long sequences. RNNs, which scale linearly with the sequence length, are therefore typically faster than transformers at inference time even for modest sequence lengths (Liu et al., 2019).

Motivated by these problems, Gu et al. (2021a) recently introduced the S4 model, a carefully designed deep state-space model (SSM) achieving remarkable performance on tasks from the Long Range Arena (LRA) (Tay et al., 2020), a benchmark explicitly designed to require very long-ranged reasoning. S4 is theoretically principled and inspired by continuous-time linear SSMs; well-established components of modern control systems. More importantly, the S4 layer and its variants (DSS, S4D, S5, etc) (Gupta et al., 2022a; Gu et al., 2022a; Smith et al., 2022) overcome the O(L2)O(L^{2}) bottleneck of attention layers by modeling interactions between tokens using a hidden state (like RNNs) under proper discretization techniques. These models can be made very efficient at inference time by simply unrolling the layer like an RNN. Futhermore, since SSMs are linear in the temporal dimension, they are easily parallelizable during training, in contrast to the slow sequential nature of training a typical RNN. This makes them very computationally efficient on long sequences.

While the S4 model is equivalent to an RNN during inference, it has a number of unique characteristics during training. For example, S4 is parameterized as a discretization of a latent continuous-time system of differential equations. S4 also uses specific initializations of the state matrices motivated from the theory of polynomial projections (Gu et al., 2020). While these characteristics might seem to motivate the impressive performance of these models, later works (Gu et al., 2022a; Smith et al., 2022; Gupta et al., 2022a, b) have suggested that the specific initialization used by S4 is often not crucial for performance, and that the discretization rules which achieve best performance may deviate from theory (Smith et al., 2022). It is therefore unclear what these unique characteristics of the deep SSMs are doing mechanistically, and how they can be simplified.

Motivated by the striking similarities between RNNs and deep SSMs, and in an attempt to better understand the underlying mechanism driving the performance of these models, we study the power and limitations of RNNs when used as core components of deep architectures for long-range reasoning. Our main goal is to answer the question:

“Can we match the performance and efficiency of deep continuous-time SSMs using deep RNNs?”

We give a positive answer to this question. We show that the performance boost provided by deep SSMs like S4 can also be achieved via a series of small changes to a vanilla deep RNN. With these changes, we can recover the performance and efficiency of these deep SSMs on the Long Range Arena (LRA) benchmark (Tay et al., 2020). We call this new RNN model the Linear Recurrent Unit (or LRU for short).

We outline here the main steps needed towards crafting performant and efficient RNN models. Note while some of these observations have been made in prior works (see §B), we provide novel perspectives and careful ablations leading to new insights. Each step presented in this paper unveils a specific property of recurrent networks, and showcases the challenges and best practices in training and initializing deep RNNs.

Linear Recurrences. When replacing SSM layers in a deep architecture with vanilla RNN layers using tanh or ReLU activations, the performance on Long Range Arena (LRA) drops significantly. Surprisingly, in §3.1 we find that simply removing the nonlinearities in the recurrence of the RNN (i.e., using linear recurrences) gives a substantial boost in test accuracy. We motivate this effect in §E.1 by showing that stacking linear RNN layers and nonlinear MLP blocks (Fig.1) can indeed model complex nonlinear sequence-to-sequence maps without the need for nonlinearities in the recurrence. While dropping the nonlinearity does not seem to harm expressivity, it leads to several advantages, from the ability to directly control how quickly the gradients might vanish or explode, to allowing us to parallelize training. Our findings also partially motivate the success of deep SSMs, where the recurrence is also linear.

Complex Diagonal Recurrent Matrices. Dense linear RNN layers can be reparameterized to a complex diagonal form without affecting the expressivity of the network or the features at initialization (§3.2). Diagonal linear RNN layers additionally allow for a highly parallelizable unrolling of the recurrence using parallel scans to substantially improve training speeds (Martin and Cundy, 2017). We validate that these observations, which have been leveraged by prior SSMs (Gupta et al., 2022a; Smith et al., 2022), also provide important efficiency improvements for linear RNN layers.

Stable Exponential Parameterization. In §3.3 we show that using an exponential parameterization for the diagonal recurrent matrix has important benefits. Crucially, this enables us to easily enforce stability during training, which in turn allows us to modify the initialization distribution to facilitate long-range reasoning and improve performance. Our results indicate that rather than the specific deterministic initializations used by several recent SSMs, it is the eigenvalue distribution of the recurrent layer at initialization that determines if the model can capture long-range reasoning.

Normalization. In §3.4 we show that normalizing the hidden activations on the forward pass is important when learning tasks with very long-range dependencies. With this final modification, our RNNs can match the performance of deep SSMs on all tasks in the LRA benchmark. Connecting back to state-space models, we show in §4 how our normalization can be linked to the discretization structure in S4.

We summarize the deep Linear Recurrent Unit (LRU) architecture used in this paper, and the effect of each of the above steps on performance in Fig.1. We emphasize that the main purpose of our work is not to surpass the performance of S4-based models, but rather to demonstrate that simple RNNs can also achieve strong performance on long range reasoning tasks when properly initialized and parameterized. We believe the insights derived in this paper can be useful to design future architectures, and to simplify existing ones.

Preliminaries

In this section, we compare the key architectural components (RNNs and SSMs) studied in this work, and also describe our methodology and experimental setup. For a more thorough discussion or related architectures, the reader can check our related work section §B.

We give an overview of the main architectural components considered in this paper, focusing on the major difference between Vanilla RNNs and recent S4-like deep SSMs (Gu et al., 2021a, 2022a; Gupta et al., 2022a; Smith et al., 2022).

For training and inference, the continuous-time system in Eq.(2) is discretized at stepsize Δ\Delta through a high-accuracy Zero-Order-Hold (ZOH) or Bilinear method. The ZOH method gives

It is worth pointing out a few structural and computational properties, to highlight some crucial differences between RNNs and SSMs:

Since Eq.(3) is linear, it can be efficiently parallelized until k=L1k=L-1 using parallel scans (Martin and Cundy, 2017; Smith et al., 2022), unlike a nonlinear RNN where the computation has to be performed sequentially.

Unlike vanilla RNNs, most SSMs use complex-valued diagonal recurrent matrices that are initialized deterministically using HiPPO theory, and the literature attributes much of the success of SSMs to the specific initialized used (Gu et al., 2021a; Gupta et al., 2022a; Gu et al., 2022b).

The points above motivate our investigation: in this paper we consider the same architecture as Gu et al. (2021a, 2022a); Smith et al. (2022), but replace the SSM layer in the recurrent core by an RNN. We then study which steps need to be taken to gradually retrieve S4-like performance on LRA (Tay et al., 2020) tasks. The effectiveness of each of our steps is supported by empirical evidence and theoretical considerations, and leads to the architecture presented in Fig.1.

2 Experimental setup

In this paper, we consider the Long Range Arena benchmark (Tay et al., 2020), a set of tasks designed to test the ability of models to do long-range sequence modelling (except we use coloured images instead of grayscale images for the sequential CIFAR-10 classification task). Transformers fail to perform well on most of these tasks, while deep SSMs have shown remarkable performance on these tasks (Gu et al., 2021a; Dao et al., 2022a). This makes it an appropriate benchmark to explore the long-range modelling capabilities of deep RNNs.

For all our experiments, we use a network of 6 layers with residual connections and layer/batch normalization (Ioffe and Szegedy, 2015; Ba et al., 2016) similar to Gu et al. (2021a) (Fig.1), and we replace the SSM layers with RNN layers, building up to our LRU recurrence in a sequence of steps (see §3). All experiments are repeated three times, and we report the mean and standard error. Networks are trained using the AdamW optimizer (Loshchilov and Hutter, 2017). We use a smaller learning rate and no weight decay on the recurrent parameters, as suggested by Steil (2004); Gu et al. (2021a). We tune hyperparameters such as learning rates for all models on a logarithmic grid for best accuracy. See §D for more details on our experimental setup.

Designing Performant Deep RNNs

In this section, we discuss the fundamental steps needed for designing RNNs to reach the impressive performance of deep SSMs on the LRA benchmark. We present these steps, already outlined in the introduction, in logical order, and support each claim with experimental evidence and theoretical considerations, expanded in §E.

We consider the architecture of Fig.1, where the recurrent computation is gradually modified starting from a vanilla RNN. We start by showcasing the advantage of using linear recurrences in §3.1; then, in §3.2, we show how to speed-up training and inference without affecting expressivity and initialization distribution. In §3.3, we discuss how (and why) changing the parameterization and initialization distribution enables us to make the RNN stable and improve long-range modeling. Finally, in §3.4, we finalize the LRU architecture by proposing a normalization strategy for the hidden activations that results in a close match in performance with deep SSMs.

One of the main findings of our work is that linear RNN layers can be surprisingly expressive when coupled with nonlinear MLP or GLU (Dauphin et al., 2017) blocks, outperforming tuned nonlinear RNN variants in the same architecture. In Tb.1, we show that simply removingAll other settings in the recurrent block are kept the same as in the Vanilla RNN module of Haiku (Hennigan et al., 2020). That is, all matrices have Glorot (Glorot and Bengio, 2010) initialization. The rest of the architecture is kept as in Fig.1, where the LRU block is replaced by an RNN. the nonlinearity, and therefore computing the next state as xk=Axk1+Bukx_{k}=Ax_{k-1}+Bu_{k}, is able to improve test accuracy on most LRA tasks. While the boost provided by vanilla linear RNN blocks leads to performance which is still far behind S4 on some tasks (sCIFAR, PathFinder and PathX), this first finding motivates us to drop nonlinearities in the recurrence for the rest of this paper. In later sections, we leverage the linearity of the recurrence to significantly speed up training as well as derive principled initialization and normalization principles to learn long-range dependencies. We note that, on the Text and Retrieval tasks, performance using vanilla RNNs already matches performance of deep SSMs (see Tb.3 for the performance of S4D/S5 on these tasks).

The empirical result in Tb.1 is surprising, since recurrent nonlinearities are believed to be a key component for the success of RNNs — both in the theory and in practice (Siegelmann, 2012; Pascanu et al., 2013; Erichson et al., 2021). Indeed, a strong property of single-layer sigmoidal and tanh\tanh RNNs is Turing completeness, which cannot be achieved by the linear variant (Chung and Siegelmann, 2021). However, the architecture we use (Fig.1) is deeper than a standard RNN and includes nonlinearies, placed position-wise after each RNN block. In §E.1, we investigate how the expressivity and trainability of deep models is affected by recurrent nonlinearities. Leveraging a spectral analysis and Koopman operator theory (Koopman and Neumann, 1932), we discuss how interleaving linear RNN layers with nonlinear feedforward blocks is sufficient to approximate highly nonlinear systems. A key observation in our analysis is that position-wise nonlinearities effectively transfer signal information to higher frequencies, enabling the system to go beyond linearity in the spectral domain and increasing the layer capacity. To further strengthen our claim on the advantage of linear recurrences, in §E.2 we show that, while linear and nonlinear RNNs share an important class of approximating functionals (linear operators, see Wang et al. (2022)), nonlinear activations can potentially slow down training.

2 Using complex diagonal recurrent matrices is efficient

We now show that we can significantly speed up training and inference for deep linear RNNs without losing performance by using complex-valued diagonal recurrent matrices. While the idea of diagonalizing linear systems for computational efficiency is a dominating feature of all deep SSMs since the introduction of DSS by Gupta et al. (2022a), in this section we construct our diagonalized version to exactly match the initialization spectrum (see §3.2.1) of the Glorot-initialized deep linear RNN in Tb.1. Our main purpose with this approach is to disentangle the effects of initialization and diagonalization on performance (cf. Tb.2 and Tb.3).

We start in §3.2.1 by recalling some useful linear algebra elements, and then proceed in §3.2.2 with a discussion on how to diagonalize the recurrence while preserving the eigenvalue spectrum at initialization.

We adopt complex numbers since they provide a convenient and compact representation of non-symmetric matrices in diagonal form. However this is not the only option – one could work (almost) as efficiently using real numbers. We discuss how this can be achieved in §E.3.

Since xˉk=j=0k1ΛjBˉukj\bar{x}_{k}=\sum_{j=0}^{k-1}\Lambda^{j}\bar{B}u_{k-j}, the norm of component jj of xˉ\bar{x} at timestamp kk evolves such that xk,j=O(xˉk,j)=O(λjk)|x_{k,j}|=O(|\bar{x}_{k,j}|)=O(|\lambda_{j}|^{k}). Therefore, a sufficient condition to ensure stability (i.e. xkx_{k} does not explode) is therefore λj<1|\lambda_{j}|<1 for all jj (Gu et al., 2021a).

2.2 Learning in the diagonalized space

Learning recurrent linear systems in diagonal form provides substantial computational speedups both for training and inference. For example, in our implementation of sCIFAR, we found diagonal linear RNNs to be \sim8 times faster to train than a dense RNN with ReLUs, matching the speed of our implementations of S4D and S5. The main reasons for this computational benefit are that (a) taking powers of diagonal matrices is trivial (speeding up both training and inference), while exponentiating dense matrices is computationally expensive, and (b) while nonlinear recurrences must be computed sequentially, unrolling a linear recurrence can be parallelized using associative scans resulting in faster training (Smith et al., 2022; Gupta et al., 2022a).

To disentangle the benefits of diagonal linear systems from the role of initialization, we seek an initialization for the diagonal system which keeps the eigenvalue spectrum of the recurrence unchanged when comparing our diagonal system with the dense linear RNN in §3.1, where AA followed Glorot initialization. Fortunately, we can use a classical result from random matrix theory (Ginibre, 1965).

Our results in Tb.2 show that diagonalizing the recurrence surprisingly improves accuracy on tasks like ListOps and sCIFAR. More importantly, it drastically reduces training and inference time on all LRA tasks (see Tb.4 in §C.1 for training speed comparisons), and makes the RNN just as fast to train as deep SSMs like S4D and S5.

3 Benefits of stable exponential parameterization

In §3.2 we showed that moving to complex diagonal recurrences is computationally efficient. However we also observed that learning the diagonal model can be more unstable than learning the dense model in some experiments. To learn long-range dependencies and avoid quickly vanishing gradients, eigenvalues in the recurrence need to have magnitude close to 1 (Gu et al., 2022b; Gupta et al., 2022a); however, these eigenvalues are also likely to make the system unstable during training. In this section, we show the benefits of a stable parameterization of the RNN, and of tuning rminr_{\min} and rmaxr_{\max} (see Lemma 2).

The benefits of the stable parameterization becomes more apparent when we explore the idea of initializing the eigenvalues of Λ\Lambda on a ring closer to the unit disk (increasing rminr_{\min} closer to 11 in Lemma 2) to bias the network towards longer-range interactions and avoid vanishing gradients. Indeed, as discussed in detail in Gu et al. (2022b); Gupta et al. (2022a), for reasonings requiring consideration of interactions between distant tokens, eigenvalues in the recurrence need to have magnitude close to 11. Otherwise, as clear from the diagonal version of Eq.(4), when taking powers of eigenvalues close to the origin, the signal from past tokens quickly dies out (see §3.2.1). As we show in the last row of Tb.5 in §C, without enforcing stability, performance starts to degrade as we increase rmaxr_{\max} past 0.9 in the sCIFAR task. With stability enforced, we can increase rmaxr_{\max} up to 0.99 and improve performance. We see similar benefits on the other tasks where we sweep different values of rminr_{\min} and rmaxr_{\max} (Tbs.7 & 8 have more details). Finally, note that while here we explore changing the magnitude of the eigenvalues of Λ\Lambda, in §3.4 we also show the benefits of initializing the eigenvalues to have a small phase to learn more global patterns, useful for particularly long-range reasoning tasks.

4 Additional considerations for long-range reasoning tasks

Up to this point, our model did not succeed in learning PathX — the hardest dataset in our benchmark, with a sequence length of 16k16k tokens. In this section, we discuss the additional modifications we need to make to improve our model’s ability to learn very long-range dependencies and finalize our LRU model.

In §3.3, we initialized the eigenvalues of Λ\Lambda close to the unit disk for better performance on long-range tasks. However, we observed that as we moved rminr_{\min} and rmaxr_{\max} closer to 1, the training loss also started to blow up at initialization (see Fig.5). In this section, we first present a result explaining this phenomenon, before deriving a practical normalization scheme for the hidden activations to tackle this problem and further improve performance.

This result has the following intuitive form if rmin=rmax=rr_{\min}=r_{\max}=r: if we initialize ρ\rho-close to the unit disk, the forward pass blows up by a factor 1/ρ1/\rho (since the contributions from previous states take longer to decay): let ϵ=rmax2rmin2\epsilon=r_{\max}^{2}-r_{\min}^{2} and ρ=1rmax2\rho=1-r_{\max}^{2}, then:

where \odot denotes the element-wise product. The γ\gamma parameter allows the RNN to adaptively scale the input fed into the corresponding eigendirection. We found the γ\gamma normalization to consistently improve performance on tasks that benefit from initializing close to the unit disk, such as sCIFAR and Pathfinder, as shown in Tb.3.

Therefore, we claim that initializing Λ\Lambda with uniform phase on long sequence data inherently biases the network towards learning spurious features in the input sequence. The model cannot recover from this suboptimal initialization: we indeed observe that, for our best to far model on PathX, the training loss after a few iterations converges to a highly suboptimal minimizer which leads to random chance test performance (see Fig.5). To fix this issue, we found it sufficient to restrict the range of θ\theta to a thin slice around , biasing the model towards learning more global features. Since the optimal values of θ\theta are small, we parameterize the phase logarithmically: θ=exp(θlog)\theta=\exp(\theta^{\log}), where θlog\theta^{\log} is optimized, to aid optimization.

Restricting the range of the phase at initialization to be [0,π/10][0,\pi/10], our LRU achieved 94.2%94.2\% on PathX, aligning with state-of-the-art deep SSMs. We did not explore using a smaller phase at initialization for the other LRA tasks, although we believe this might further improve performance on other tasks as well. Note that using both γ\gamma normalization and restricting the eigenvalue phase at initialization were crucial to solving PathX. We were unable to learn when using restricted phase at initialization without also introducing γ\gamma normalization.

With all the components of §3 taken together, we name this new model the Linear Recurrent Unit (or LRU for short). It provides a flexible, interpretable, and principled framework for initializing and learning deep RNNs efficiently, and matches performance and efficiency of deep SSMs across all LRA tasks as shown in Tb.3.

Insights on S4 and Variants

We believe our ablations in §3 explain the underlying mechanisms driving the success of deep SSMs. Hence, to conclude the paper, in this section, we inspect in detail the main similarities and differences between our LRU model and diagonal SSMs, and elaborate a few insights. As in §2, to avoid technicalities, we provide a simplified discussion capturing the main features of models stemming from the original S4 paper. For a comparison of different models, we defer the reader to §B.

While Gu et al. (2022a); Gupta et al. (2022b); Smith et al. (2022) also discuss initializations for AA deviating from the HiPPO structure (see §2 and §B), to the best of our knowledge we are the first to show that simple uniform initialization on a slice of the unit disk, combined with proper normalization, is able to also solve the hardest task in LRA: PathX.Among the models in (Gu et al., 2022a), only S4D-inv and S4D-LegS (options heavily inspired by the HiPPO theory) perform beyond random guessing on PathX. In S5, the skew-symmetric component of the HiPPO matrix is used for initialization. We also show (Tb.2) that uniform initialization on the disk, which is simply the diagonalized version of Glorot initialization (Thm. 1), is sufficient to achieve performance close to more complex deep state-space models on the remaining LRA tasks. Our results ultimately suggest that HiPPO theory, while fundamental for the development of this field, should not be thought of as the main source of S4 success.

From this discussion, we conclude that the success of (diagonal) state-space models is attributable to the use of linear recurrences and complex diagonal exponential matrices, combined with the normalization and initialization induced by discretization. On the other hand, other artifacts of discretization such as parameter sharing or the continuous-time interpretation do not necessarily contribute to its performance.

Conclusion

In this paper, we introduce a new RNN layer called the Linear Recurrent Unit or LRU and show how it can be effectively and efficiently used as core layers of deep sequence models. We provide theoretical insights and extensive ablations on a series of step-by-step modifications of a vanilla RNN—linearization, diagonalization, stable exponential parameterization and normalization—that substantially improve performance, especially on tasks requiring long range reasoning. While our recurrence shares similarities with modern deep SSMs, our design does not rely on discretization of a latent continous-time system or on structured transition matrices. Instead our improvements directly follow from initialization and forward pass analysis arguments standard in the deep learning community, starting from a Glorot-initialized RNNs. Our final model matches the performance of modern deep state-space models (e.g. S4 or S5) on all LRA tasks.

Acknowledgements

The authors would like to thank Michalis Titsias, Aleksandar Botev, James Martens and Yee Whye Teh for the interesting discussions and perspectives on our work.

References

Supplementary Materials

Appendix A Simplified Implementation of the Linear Recurrent Unit

We present here a simplified JAX implementation (Bradbury et al., 2018) of the Linear Recurrent Unit (LRU). The state of the LRU is driven by the input (uk)k=1L(u_{k})_{k=1}^{L} of sequence length LL according to the following formula (and efficiently parallelized using an associative scan): xk=Λxk1+exp(γlog)(Buk)x_{k}=\Lambda x_{k-1}+\exp(\gamma^{\log})\odot(Bu_{k}), and the output is computed at each timestamp kk as follows: yk=Cxk+Duky_{k}=Cx_{k}+Du_{k}. In our code, B,CB,C follow Glorot initialization, with BB scaled additionally by a factor 2 to account for halving the state variance by taking the real part of the output projection. DD is random HH-dimensional and mutiplies elementwise each uku_{k}, where kk is the timestamp. Λ\Lambda is initialized with the help of Lemma 2, with phase potentially restricted to a thin slice (see §3.4).

Appendix B Related works

We first discuss standard RNN-based approaches for sequence-to-sequence modeling, and then provide a historical overview on the progress of the literature stemming from the S4 paper (Gu et al., 2021a).

Before the rise of transformers (Vaswani et al., 2017), RNNs were widely used in various applications of natural language processing tasks such as language modeling (Mikolov et al., 2010), machine translation (Cho et al., 2014b) and text summarization (Nallapati et al., 2016). The modern RNN structure (see Eq.1) is mainly attributed to the works of Rumelhart et al. (1985). However, it is possible to see the Hopfield Networks as a particular form of RNN (Hopfield, 1982). Modern RNN formulations are also often related to the Elman Networks (Elman, 1990). The issue of vanishing or exploding gradients, as described by Bengio et al. (1994); Pascanu et al. (2013), is one barrier to training Recurrent Neural Networks (RNNs) with gradient descent. This problem limits the ability of RNNs to learn, especially on tasks with long input sequences. One of the critical contributions to the success of RNNs was the introduction of gating mechanisms such as the Long Short-Term Memory (LSTM) proposed by the Hochreiter and Schmidhuber (1997). LSTMs address the vanishing gradients problem by introducing input, output, and forget gates, which enable the network to selectively remember or forget information from previous time steps. Another popular variant of gated RNNs is the Gated Recurrent Unit (GRU) (Cho et al., 2014b) which simplifies the LSTM architecture by merging input and forget gates into a single update gate.

Recently, Arjovsky et al. (2016) introduced unitary evolution RNNs (uRNN), where eigenvalues in the RNN transition matrix (see Eq. (1)) are restricted to live on the unit circle. The induced map driving the hidden state evolution, therefore, mixes state components taking into account new inputs — but the signal from past timestamps is not exponentially vanishing/exploding as in the vanilla RNN case (see discussion on stability in §3.2.1). This idea is powerful but introduces two problems: (1) choosing unitary transitions restricts the function approximation class, and (2) training unitary matrices is expensive since a projection on the Stiefel manifold is required at each gradient step. To resolve the second issue, many works devoted attention to carefully designed reparameterization of the transition matrix as e.g., with the product of simpler matrices (Arjovsky et al., 2016), Givens rotations (Jing et al., 2017), Householder reflections (Mhammedi et al., 2017), or as exponentials of skew-symmetric matrices (Hyland and Rätsch, 2017; Lezcano-Casado and Martınez-Rubio, 2019). The approximation capacity of these models is discussed and improved in (Wisdom et al., 2016). A further step in designing efficient orthogonal RNNs is provided by Helfrich et al. (2018), who parametrized skew-symmetric matrix using the Cayley transforms, resulting in a fully real parameter space. Other works which proposed conceptually different solutions to mitigate the vanishing gradient problem include combinations with rectified linear units (Le et al., 2015), Lipschitz RNNs (Erichson et al., 2021), and approaches based on dilated convolutions to increase context size (Oord et al., 2016; Bai et al., 2018)

Later, Gu et al. (2021a) scaled up this idea into a deep architecture, where a collection (one for each input dimension) of discretized continuous-time structured SSM was placed at each layer as a substituteThis idea is also leveraged in FNet (Lee-Thorp et al., 2021), where the attention mechanism is replaced with a simpler linear token-mixing strategy. for the attention block, in an attempt to mitigate the O(L2)O(L^{2}) issue in transformers and provide a theoretically principled component for sequence-to-sequence modeling. The model reached state-of-the-art on the Long Range Arena benchmark (Tay et al., 2020), effectively showcasing the power of discretized linear recurrences using structured transition matrices. Notably, the resulting model, named S4, uses a convenient and stable representation of the HiPPO transition, which is initialized using a normal + low-rank matrix and then learned efficiently in diagonal + low-rank form using fast Fourier transforms (FFTs) and Cauchy kernels.

In the months following the publication of S4, Gupta et al. (2022a) noticed that most of S4 performance can be retrieved by only considering the diagonal component of the HiPPO matrix, and therefore showed the power of discretized diagonal structured continuous-time state space models. This architecture is known as DSS. As the interest of the community was rising, with first applications of DSS and S4 in language (Mehta et al., 2022), vision (Nguyen et al., 2022) and audio (Goel et al., 2022), Gu et al. (2022a) further simplified DSS providing a diagonal form (S4D) with theoretical guarantees in the infinite width setting. Notably Gu et al. (2022a) showed that, to retrieve most performance of S4, one can simply initialize the transition matrix AA in diagonal form, with entries an=12+iπna_{n}=-\frac{1}{2}+i\pi n (S4D-Lin) or an=12+iNπ(Nn+11)a_{n}=-\frac{1}{2}+i\frac{N}{\pi}\left(\frac{N}{n+1}-1\right) (S4D-Inv). Our interest in S4-like models spiked at this point since the findings of Gu et al. (2022a) suggest that, given the effectiveness of such simplified versions of AA, the root of S4 success might be attributable to more fundamental effects are orthogonal to the HiPPO theory.

Shortly after, Smith et al. (2022) found that one can also depart from the formal one-dimensional discretization structure of S4, rooted in the HiPPO theory, and considered a simplified version where all input dimensions are efficiently and simultaneously processed using parallel scans (Martin and Cundy, 2017) — not separately like in S4, S4D, and DSS. This model (named S5) set a new state-of-the art on PathX, the hardest task in the Long Range Arena, and provides further evidence for a conceptually simpler motivation for the performance of deep state-space models. Indeed, as already mentioned, S5 is not precisely the discretization of a latent continuous-time SSM, yet still includes parameters like discretization stepsizes that have an ambiguous interpretation in this contextOne can still view S5 as a discretized version of a continuous-time SSM. However, this requires adjusting the input projection matrix., suggesting further investigations are needed.

At the same time, a few interesting works developed novel variants of the S4 architecture. Liquid S4 used the original (non-diagonal) S4 formulation combined with liquid time-constant networks (Hasani et al., 2021, 2022). Similar to DSS, S4D, and S5, Mega also simplified S4 to a diagonal SSM (Ma et al., 2022) while showing additionally that restricting the diagonal AA to real numbers – giving it an exponential moving average (EMA) interpretation – can still work well when combined with attention and a gated block design. Another intriguing view was provided by the SGConv model (Li et al., 2022a), which leverages the convolutional interpretation of SSMs (Gu et al., 2021b) to design a purely filter-based version of S4, with no latent continuous-time model or need for discretization.

The discretization viewpoint also attracted the interest of Gupta et al. (2022b), concurrent to this work, who pointed out that, after numerical integration, diagonal state-space models and linear RNNs share the same function approximation class. Gupta et al. (2022b) then introduced DLR, most closely related to DSS and S4D (each input is processed independently at each layer) but where the discretization stepsize Δ\Delta is absorbed into the continuous-time transition matrix AA (see §2). Their focus was on a new set of synthetic long-range tasks with strong supervision (e.g. segmentation), while ours is on the established Long Range Arena benchmark.

To conclude, we point the reader to interesting recent applications of models inspired by the S4 architecture. In addition to earlier applications in NLP (Mehta et al., 2022), more sophisticated architectures based on S4 recently showed great promise in language modeling (Dao et al., 2022b; Ma et al., 2022). Specifically, Dao et al. (2022b) designed a new generative language model, H3, that outperforms GPT-Neo-2.7B with SSMs, augmented with two attention layers. Besides language, deep state-space models were also found successful for long video/audio understanding and generation tasks (Islam and Bertasius, 2022; Nguyen et al., 2022; Goel et al., 2022), and have attracted interest in biology (Bordin et al., 2022) and time series forecasting (Zhou et al., 2022).

Appendix C Additional experimental results

In Tb.4, we show training speed comparisons of the LRU with a regular RNN with tanh activations, as well as with the S4D and S5 models. As we elaborate in §2.2, for the LRU, we closely followed the optimal model sizes of the S5 model. Consequently, we also see similar training speeds as the S5 model on all tasks.

C.2 Effect of stability and normalization

In this section, we explore further the effect of introducing stability during training (§3.3), as well as introducing the γ\gamma normalization factor as shown in Eq.(7). To do this, we consider the sCIFAR experiment where we sweep over different settings of rmaxr_{\max} and rminr_{\min} to see the effect when initializing closer to the unit disk. We keep the learning rate fixed at 0.004 for these experiments, which we found to be optimal when initializing with rmax=1.0r_{\max}=1.0 and rmin=0.0r_{\min}=0.0 under a stable exponential parameterization.

We show our results in Tb.5. In the first table Tb.5(A), we show results with our baseline where we use the exponential parameterization described in §3.3. We see that under this setting, the optimal performance is achieved when rmax=rmin0.9r_{\max}=r_{\min}-0.9, and performance degrades as rmaxr_{\max} is increased beyond 0.9.

In Tb.5(B) we show results after enforcing stability. We now notice that for each rminr_{\min}, the optimal performance is achieved by a higher rmaxr_{\max} than before, i.e., training is more when initializing closer to the unit disk. Our optimal performance in this setting is achieved using rmin=0.0r_{\min}=0.0 and rmax=0.99r_{\max}=0.99. Note that even in this setting, performance can sometimes degrade when moving to even higher rmaxr_{\max}.

Finally, in Tb.5(C) we also incorporate the γ\gamma normalization factor, and we now notice no degradation in performance even when rmax=0.999r_{\max}=0.999. We found training to be more stable in this setting, and our best result of 89.0% performance is also obtained in this setting, with rmin=0.9r_{\min}=0.9 and rmax=0.999r_{\max}=0.999.

These ablations further motivate the benefits of enforcing stability and using the normalization parameter for better performance and more stable training, particularly when required to learn very long-range dependencies.

C.3 Expanded tables

Below we show our full results on the Long Range Arena, expanding on Tables 1, 2, and 3 in the main paper. The tables are presented in logical order: in Table 6, we show that vanilla (dense) RNNs profit from dropping recurrent nonlinearities when used in the context of the architecture in Fig. 1. Next, in Table 7 we diagonalize our linear RNN model from §3.1 and show how different parametrization for the diagonal elements affect performance. For all the rows in Table 7, initialization of the diagonal RNN was performed uniform on the disk, to match the random Glorot initialization of our dense version (Thm. 1).

Further, the last row in Table 7 shows the positive effects of changing initialization distribution to a thin ring close to the circle boundary — effectively enabling long-range reasoning through mitigation of vanishing gradients. Our settings for the ring are reported on the first row of Table 8. Finally, the second row of this table shows the improvements that can be achieved by including model normalization (Eq. (7)), which closes the accuracy gap with deep SSMs.

Appendix D Detailed experimental setup

In this section, we describe our experimental details.

We consider the standard S4 architecture of Gu et al. (2021a) and replace the S4 layers with RNN layers or with S5 (Smith et al., 2022) or S4D (Gu et al., 2022a) layers for our baselines. We give an overview of the architecture used in Fig.1. The input is first encoded into HH features, followed by a stack of residual blocks. For all our experiments, we use networks with a depth of 6 residual blocks. Each residual block consists of identity skip connection, and the residual path containing a normalization layer (in our case, we always use batch normalization in our experiments), followed by the RNN/SSM block. While using the “post-norm” option of adding the normalization layer after the skip and residual branches typically improves performance, we stick to this design due to this architecture being more scalable in general (De and Smith, 2020).

Each RNN/SSM block first contains the recurrent layer as described in Eqs.(1) and (3) in §2. This is followed by a mixing layer. For all experiments except PathX, we use the GLU activation function (Dauphin et al., 2017) with dropout as the mixing layer, similar to Gu et al. (2021a). For PathX, we instead use a GLU activation function without one additional linear transform; the same as used by Smith et al. (2022) for their experiments.

We use bidirectional models for our experiments on PathFinder and PathX, using a similar setup as Gu et al. (2021a), and use unidirectional models for the rest of our experiments.

D.2 General experimental details

We use AdamW as our optimizer (Loshchilov and Hutter, 2017). We use warmup for the learning rate, where we start from a value of 10710^{-7} and increase the learning rate linearly up a specified value for the first 10% of training. This is followed by cosine annealing for the rest of training down to a value of 10710^{-7}.

We used a smaller learning rate for the RNN/SSM parameters AA and BB. When using normalization in our RNNs, we also used a smaller learning rate on the normalization parameter γ\gamma. For our S5 and S4D baselines, we used a smaller learning rate for the discretization step size Δ\Delta. This smaller learning rate was determined by multiplying the base learning rate by a factor <1<1 (See Tb.9 for the learning rate factor used for each task).

We use weight decay for all parameters except the RNN/SSM parameters AA and BB (and γ\gamma and Δ\Delta when applicable).

All experiments were carried out on accelerated hardware A100 GPUs.

D.3 Hyperparameters

We closely followed the hyperparameter settings of the S5 model Smith et al. (2022) for all our experiments, with minimal additional tuning. For our S5 baseline, we tuned the model dimension HH and state dimension NN, and used the optimal values for the LRU model as well. For the S4D baseline, we also tuned HH and NN. For all our experiments, we tuned the base learning rate on a logarithmic grid of 2 to choose the optimal learning rate. We present the hyperparameters we used for each LRU experiment in Tb.9.

D.4 Tasks

We use the 6 tasks in the Long Range Arena benchmark for our experiments (Tay et al., 2020), with the only difference being we use colored sCIFAR images instead of the grayscale sCIFAR images used in LRA.

Appendix E Theoretical insights

We provide here theoretical groundings for some observations made in §3. We start by showing in §E.1 that, when interleaved with MLP blocks, stacked linear RNNs can model highly nonlinear dynamical systems. We provide two separate views that justify our findings: in §E.1.1, we provide a spectral explanation, while in §E.1.2 we present a function-space prespective. Our results, combined with the observation that nonlinear RNNs are difficult to optimize (§E.2), provide a justification for the results in Tb. 1. Next, motivated by the results in Tb. 3 we in discuss in the same section optimization of linear RNN blocks, and show that exponential reparameterization can accelerate training.

In our sequence-to-sequence setting, it is a natural to seek models which (at least in the width limit) are able to map inputs uu to outputs yy (last layer) using a flexible nonlinear transition map TT learned from data. Mathematically, a fully-expressive causal model should be able to approximate yk=T(uk,uk1,,u1)y_{k}=T(u_{k},u_{k-1},\dots,u_{1}), where TT is an arbitrary nonlinear map.

We show in this section how interleaving linear RNNs with MLPs in a deep architecture provides a flexible and modular recipe for the approximation of nonlinear transition maps.

In our architecture (Fig.1) an activation function, as well as a linear position-wise layer, is placed right after each RNN output. As can be seen in Fig. 6, this operation causes spectral leakage: information gets copied over different frequency components.

The behavior shown in Fig. 6 can be characterized exactly:

where F\mathcal{F} denotes the Fourier transform, \star the convolution operation and sinc(x):=sin(x)/x\text{sinc}(x):=\sin(x)/x.

This result is simple to parse: the Fourier transform of a ReLU activated signal is equal to the Fourier transform before the ReLU, convolved with a kernel which transports information to higher frequencies — an operation which is impossible for linear RNNs, even as the width increases. As such, introducing an MLP completes the list of requirements for approximations of a nonlinear transition map: frequencies can be scaled up and down arbitrarily by the RNN, and can then be translated in the space using the ReLU. As depth increases, these operations can be combined in a modular fashion, leading to highly nonlinear dynamics using easy-to-learn linear blocks, interleaved with simple activations.

To conclude, we provide a proof for the proposition above.

Recall that multiplications in the time domain are convolutions in the frequency domain.

Let now u1=uu_{1}=u and u2=χ(u1>0)u_{2}=\chi(u_{1}>0), then u1u2=ReLU(u)u_{1}\cdot u_{2}=\text{ReLU}(u). Next, let PiP_{i} be the ii-th region activated by the ReLU, and let us write Pi=[piLi,pi+Li]P_{i}=[p_{i}-L_{i},p_{i}+L_{i}]. We can write χ(u1>0)=iχ[piLi,pi+Li]\chi(u_{1}>0)=\sum_{i}\chi_{[p_{i}-L_{i},p_{i}+L_{i}]}.

Recall now the following basic properties:

Fx(tt0)(ω)=eiωt0Fx(t)(ω)\mathcal{F}_{x(t-t_{0})}(\omega)=e^{-i\omega t_{0}}\mathcal{F}_{x(t)}(\omega).

The Fourier transform of a rectangular pulse between τ-\tau and τ\tau is 2τsinc(ωτ)2\tau\cdot\text{sinc}(\omega\tau), where sinc(x)=sin(x)/x\text{sinc}(x)=\sin(x)/x.

E.1.2 Insights from Koopman operator theory

We show how Koopman operator theory (Koopman and Neumann, 1932), combined with recent advances in dynamic mode decomposition (Schmid, 2010; Kutz et al., 2016; Williams et al., 2015), can provide a solid theoretical foundation for understanding the class of functions that can be approximated by linear RNNs, interleaved with MLPs. Our notation and results are based on Korda and Mezić (2018); Mauroy et al. (2020).

For instance, let us consider n=1n=1 and the observable f(x)=sin(x)f(x)=\sin(x): the Koopman operator is the map that takes sin()KSsin(S())\sin(\cdot)\overset{\mathcal{K}_{S}}{\mapsto}\sin(S(\cdot)), i.e. advances the measurement ff one step forward in time. The crucial property of the Koopman operator is that it is linear and bounded (Mauroy et al., 2020): let f1,f2f_{1},f_{2} be two observables, then

In essence, Koopman operator theory, provides the following guarantee: any sufficiently regular nonlinear autonomous dynamical system can be made linear under a high-dimensional nonlinear blow-up of the state-space. Sounds familiar? This is exactly what a wide MLP + Linear RNN can do. Moreover, to take the system back to the original coordinate system, one just needs a linear projection with matrix VV. In practice, for identification and diagnosis of nonlinear systems (e.g. in machanical engineering), this approach is used in a truncated version, where the finite class of dominant eigenfunctions is constructed by using the dynamic mode decomposition (DMD) algorithm from Hermite Polynomials (Schmid, 2010; Kaiser et al., 2021).

This operator is again linear and bounded for regular enough SS (Korda and Mezić, 2020) — hence the analysis in the autonomous setting carries out also in this case. In particular, using the notation in the last paragraph:

In essence, Koopman operator theory, provides the following guarantee: any regular nonlinear dynamical system is representable by a linear RNN after proper nonlinear reparameterization of the inputs — which can be performed by an MLP. While we believe this connection is conceptually solid and gives substantial insights into our architecture, a quantitative discussion would require substantial technical efforts perhaps linked to recent contributions from the statistical learning community (Kostic et al., 2022).

E.2 Optimization of recurrent blocks

In this subsection we back-up some of our claims about optimization of linear RNNs with experimental findings on toy examples. Our purpose is to confirm validity of our intuition outside the deep learning setting, without architecture-dependent confounders: i.e on vanilla RNNs with one layer.

In §3 and §E.1 we showed how linear RNNs can be used as elementary recurrent blocks for the purpose of modeling complex nonlinear dynamics when stacked in deep architectures. Similarly, the results in (Li et al., 2022a) indicate that, to achieve S4 performance, one can equivalently replace the recurrent core with a collection of convolutions parametrized by filters. While a single-layer level, a (dense) RNNs (Eq.1) with tanh\tanh or sigmoid activation can express convolutions with filters (Wang et al., 2022), the results in Tb. 1 (and Fig. 1(a) in Wang et al. (2022)) indicate an advantage on test accuracy from dropping such nonlinearities in the recurrence — i.e. of making the RNN linear. Motivated by this, in Fig. 7 we consider the problem of learning a single one-dimensional convolution kernel with a single layer RNN, and compare performance of linear and tanh\tanh activations. The sequence length in this problem was 100100, and our data consists in 32 input-output one-dimensional trajectories, where the output is the result of a convolution with the kernel of elements hk:=110exp(0.015k)cos(0.04k)2h_{k}:=\frac{1}{10}\exp(-0.015\cdot k)\cos(0.04\cdot k)^{2}, which induces moderate-length dependencies in the data (see bump in the kernel in Figure 7 at k=70k=70). The 32 input sequences are generated sampling random a,ca,c parameters on a range and have form sin(0.05ak)cos(0.05ck)2\sin(0.05\cdot a\cdot k)\cos(0.05\cdot c\cdot k)^{2}. Outputs are generated by convolving each input by hh. Learning is performed using the Adam optimizer (Kingma and Ba, 2014) with standard momentum parameters.

Interestingly, already on this simple task, linear RNNs outperforms the tanh\tanh variant even after careful tuning of the stepsize. While the input-output map the system had to approximate is linear (i.e. a convolution), this result still indicates that on deep architectures, where the MLPs interleaving RNNs can quickly perform position-wise nonlinearities lifting the function approximation class (see §E.1), linear RNNs are preferrable.

E.3 On alternatives to using complex numbers

In this subsection, we show how to derive the canonical real form for a non-symmetric real-valued matrix AA, which we assume to be diagonalizable in the complex domain (always true up to arbitrary small perturbation of the entries (Axler, 1997)). This derivation is classical and can be found in many textbooks under the context of real Jordan form (more general), see e.g. Weintraub (2009). Here, we present a simplified discussion.

The action of AA on its real eigenvectors (with real eigenvalues) is trivial and analogous to the symmetric case — this corresponds to a diagonal entry in the diagonalized version of AA. For the subspaces spanned by complex eigenvalues, the discussion is more interesting: let λ,λ\lambda,\lambda^{*} be a pair of conjugate eigenvalues with corresponding eigenvectors v,vv,v^{*}. Collect v,vv,v^{*} in a N×2N\times 2 matrix VV, then

The careful reader might recognize that, in the resulting system, matrix multiplication for the 2×22\times 2 blocks is algebraically equivalent to multiplication of the corresponding complex numbers. Hence, while complex numbers are not per-se needed to find a simple representation of non-symmetric matrices, they are convenient to work with since the matrix in Eq. (26) is structured: has 4 entries but can be represented using just two — real and imaginary parts, exactly what a complex number stores in memory.

Appendix F Proofs

In this section we provide proofs for the propositions listed in the main paper.

We provide here a proof for the following sampling lemma. See 2

First, note that one can sample phase and magnitude independently by symmetry of the target distribution. Phase sampling can trivially performed through scaling a uniform distribution.

Next, we consider sampling the magnitude. The area of the ring in between rminr_{\min} and rmaxr_{\max} is π(rmax2rmin2)\pi(r_{\max}^{2}-r_{\min}^{2}), while the cumulative distribution function for the radius distribution is such that Fr(rmin)=0F_{r}(r_{\min})=0, Fr(rmax)=1F_{r}(r_{\max})=1 and for r[rmin,rmax]r\in[r_{\min},r_{\max}] we therefore have

Under parametrization of rr using the exponential, r=eνr=e^{-\nu}, one gets

Finally, we use the inverse sampling theorem (see e.g. Vogel (2002)): one can sample ν\nu using the formula ν=F1(u)\nu=F^{-1}(u), where uu is uniform on $$. By setting

from which follows that ν=12log((rmax2rmin2)u+rmin2)\nu=-\frac{1}{2}\log((r_{\max}^{2}-r_{\min}^{2})u+r_{\min}^{2}).

F.2 Proof of Proposition 3

Validity of this proposition is verified numerically in Figure 9.

Note that Λ=diag(λ1,,λN)\Lambda=\text{diag}(\lambda_{1},\dots,\lambda_{N}) is diagonal with equally distributed entries on the disk between radii rminr_{\min} and rmaxr_{\max}. One can then sample a generic entry λ\lambda using the change of variables formula for probabilities (Jeffreys, 1998) as follows (see also Lemma 2):

The expectation w.r.t θ\theta is non-zero only if n=mn=m, therefore