Recurrent neural networks: vanishing and exploding gradients are not the end of the story

Nicolas Zucchet, Antonio Orvieto

Vanishing and exploding gradients

Let us first introduce the notations we will be using throughout the rest of the paper. We consider a recurrent neural network with hidden state $h_{t}$ , update function $f_{\theta}$ parametrized by $\theta$ , and input sequence $(x_{t})_{t}$ . The average performance of the network is measured by a loss $L$ . We have

The gradient of the instantaneous loss $L_{t}$ with respect to the parameters $\theta$ is then equal to

Early work highlighted the difficulty for gradient descent to make recurrent neural networks remember past inputs that will later useful to produce a desired behavior. This is due to the fact that error signals flowing backward in time tend to either explode or vanish. The key quantity is

Since then, the analysis has been refined and the development of recurrent architectures has mostly been driven by the desire to solve this pathological issue. Most famously, the LSTM unit, and later on the GRU , solve this problem by using memory neurons that facilitate direct information storage and retrieval, and by the same way error backpropagation. Other approaches to solving this problem, to name a few, involve gradient clipping , activity normalization , careful weight initialization or enforcing architectural constraints such as hierachical processing , orthogonal weight matrices and oscillations .

The curse of memory

According to common deep learning wisdom, it is often believed that solving the vanishing and exploding gradients problem enables recurrent neural networks to learn long-term dependencies. We challenge this view and question: is solving those issues really enough to ensure well-behaved loss landscapes? We answer negatively by showing that gradients can explode as the memory of the network increases, even when the dynamics of the network remains stable.

Recurrent neural networks have something special: the very same update function $f_{\theta}$ is applied over and over. Therefore, modifying the parameters $\theta$ will not only influence one update, as changing the weights of a given layer in a feedforward neural network would, but all. As the memory of the network increases, the hidden states keep a trace of the effect of more updates. Hidden states thus become increasingly sensitive to parameter changes. This is the curse of memory. We borrow the term from , and note that Martens and Sutskever hypothesized that such a phenomenon could arise in RNNs and hinder their optimization.

Let us formalize our intuition and consider the sensitivity of the hidden state $h_{t}$ on the parameters $\theta$ :

2 Signal propagation in linear diagonal recurrent neural networks

We study how hidden state and gradient magnitudes evolve as the network encodes longer-term dependencies. Ideally, we would like these quantities not to vanish or explode. This property improves the conditioning of the loss landscape and hence eases optimization . We make the following assumptions:

Linear diagonal recurrent neural networks. We restrict ourselves to update functions of the form $f_{\theta}(h_{t},x_{t+1})=\lambda\odot h_{t}+x_{t+1}$ with $\lambda$ a vector of the size of $h_{t}$ and $\odot$ the element-wise product. For ease of exposition, we present results for real-valued $\lambda$ here; see Appendix A.2 for the complex-valued setting. While this assumption is strong, it allows us to identify some crucial mechanisms and it is satisfied for some models like S4 and LRUs . We later show our analysis can model some features of more sophisticated networks.

Infinite time horizon. We consider infinite sequences and initialize the network dynamics at $t_{0}=-\infty$ . It simplifies our calculations while being a reasonable assumption when the sequences considered are longer than the characteristic timescales of the dependencies we want to learn.

We are now equipped to analyze signal propagation in one recurrent layer, both in the forward and backward passes. We show that both hidden states and backpropagated errors explode as $|\lambda|\rightarrow 1$ .

Importantly, the variance of the hidden state goes to infinity as longer-term dependencies are encoded within the network, that is $|\lambda|\rightarrow 1$ . Additionally, the divergence speed depends on the input data distribution: it increases as consecutive time steps in the input distribution become more correlated (i.e., less of the $R_{x}(\Delta)$ terms are negligible). This behavior already highlights potential difficulties of gradient-based learning of deep neural networks containing linear recurrent layers as the variance of neural activity can become arbitrarily large, hindering learning abilities of deeper layers.

We plot the exact behavior of this quantity when the auto-correlation of $x$ satisfies $R_{x}(\Delta)=\rho^{|\Delta|}$ on Figure 1 and refer the interested reader to the appendix for a derivation of Equation 6. Framed more generally, the hidden state of the network, and thus its final output, becomes increasingly sensitive to changes in recurrent parameters as the network reaches the edge of dynamical stability ( $|\lambda|\rightarrow 1$ ).

The last quantity that we need to consider is the error that is backpropagated to the inputs $x$ of the recurrent layer. It can be observed that the backward pass is dual to the forward pass in the sense that it is a recurrent process that receives backpropagated errors $\partial_{h_{t}}L$ and it runs in reverse time:

in which we made use of $\partial_{x_{t}}h_{t}=1$ . It follows that the analysis we did for the forward pass also holds here. Crucially, this implies that the explosion behavior will be most significant for the recurrent parameters rather than for potential input or readout weights.

3 Extending the analysis to the non diagonal case

Mitigating the curse of memory

We have discussed the sensitivity of recurrent networks to parameter updates. Given this problem, how can it be mitigated? Recurrent networks with diagonal connectivity are particularly well suited for this purpose. Besides enabling control over the Jacobian and avoiding exploding gradients, they facilitate the mitigation of the curse of memory. In this context, we demonstrate that state-space models and gated RNNs inherently incorporate such mechanisms.

2 Several RNN architectures implicitly alleviates the curse of memory

State-space models, as well as gated RNNs, feature some form of normalization and reparametrization which facilitates signal propagation. We discuss how below.

While the original motivation behind gated RNNs such as LSTMs or GRUs largely differs from the one of SSMs, they share similar mechanisms. In these networks, the memory content stored in hidden neurons can be erased through a forget gate, and incoming inputs can selectively be written in memory through an input gate. Mathematically, this corresponds to hidden state updates of the form $h_{t+1}=f_{t+1}\odot h_{t}+i_{t+1}\odot x_{t+1}$ , with the forget $f_{t+1}$ and input $i_{t+1}$ gates being independent non-linear functions of $x_{t+1}$ and $h_{t}$ . The forget gate is akin to $\lambda$ and usually involves a sigmoid non-linearity, which has a similar effect in the backward pass as reparametrizing $\lambda$ . The input gate can act as an input normalization depending on the initialization of the network or if is coupled to the forget gate as in the GRU ( $f_{t}=1-i_{t}$ ) . Importantly, the gates here depend on the hidden states and thus make the Jacobian $\partial_{h_{t}}h_{t+1}$ non diagonal. Yet, we argue that these architectures still have a bias towards diagonality. Indeed, the contributions of the hidden state through the forget and input gates are indirect, and they can be ignored when the weights connecting the hidden states to the gates are small. We therefore get back to the setting we discussed in the previous paragraph; we confirm this intuition in Section 5. In regimes in which this approximation does not hold, studying signal propagation requires a much more sophisticated anaylsis than the one we have done here .

A linear teacher-student analysis

We consider a teacher-student task with linear recurrent networks . This is arguably the simplest setting in which one can train recurrent networks, and yet, as we shall see, it is remarkably complex. We first turn to the one-dimensional setting to provide an intuitive illustration of how the curse of memory and vanishing gradients interplay. We then address the general setting and observe that linear networks indeed suffer from the curse of memory, and that the remedies we studied in the last section are effective. We additionally find that diagonality greatly modifies the structure of the loss landscape and helps optimizers with adaptive learning rates to compensate for an eventual increased sensitivity.

We first consider a student and teacher following the one-dimensional dynamics $h_{t+1}=\lambda h_{t}+x_{t+1}$ , with complex-valued parameter $\lambda$ for the student and $\lambda^{*}$ for the teacher. For simplicity, we draw $x_{t+1}$ from a normal distribution with mean 0 and standard deviation 1 and note that other input distributions do not qualitatively change the results. The performance of the student is measured by a loss $L$ that averages the per time-step losses $L_{t}:=\frac{1}{2}|h_{t}-h_{t}^{*}|^{2}$ over the entire sequence.

This simple model already captures two key difficulties of gradient-based learning of recurrent neural networks. In Figure 1, we plot the resulting loss landscape for different $\lambda^{*}$ values, when $\lambda$ evolves on the positive part of the real axis (Fig. 1.B) and when it evolves on the circle of radius $|\lambda^{*}|$ in the complex plane (Fig. 1.C). We restrict $\lambda$ s to have absolute values smaller than one: exploding gradients are out of the picture. Still, two difficulties for gradient-based learning appear here. On one side, vanishing gradients lead to flat loss regions that are hard to escape. On the other side, the loss sharpens as the student encodes longer memories because of the curse of memory. As a consequence, gradient-based optimization is extremely tedious, already in this simple example.

2 Diagonal connectivity simplifies optimization

We now move to the general case in which the teacher evolves according to

Next, we wonder which design choices behind the LRU architecture are crucial to this performance improvement. To this end, we interpolate between a linear RNN and an LRU in the following way: First, we restrict the weight matrix of the linear RNN to a block diagonal with blocks of size 2. Each of such blocks can represent a complex number, so 32 complex numbers in total. We additionally double the number of hidden neurons. Second, we change those $2\times 2$ blocks (and their input and output weights) to be complex numbers. Finally, we add the $\gamma$ input normalization and the exponential parametrization to obtain the final LRU architecture. We report the results of this experiment in Figure 3.B. We find that most of the gap comes from the introduction of complex numbers and can be partially reduced by making the weight matrix block diagonal. Interestingly, those two changes reduce the number of parameters the model has and slightly reduce the model expressivity so an explanation of this behavior is likely to be related to the optimization properties of those models. We confirm this hypothesis in the next section.

3 On the importance of adaptive learning rates

So far, our results highlight the importance of directly parametrizing the complex eigenvalues of the recurrent connectivity matrix. This parametrization does not mitigate any exploding behavior but modifies the loss landscape, making it possible for optimizers with adaptive learning rates to compensate for these behaviors. To demonstrate this, we study the Hessian of the loss:

If the network can perfectly fit the target data, which is the case here, the second term vanishes at optimality. We plot the Hessian at optimality in Figure 4.A and B for a standard linear recurrent network and one with complex diagonal parametrization, both with $4$ hidden neurons ( $\nu=0.99$ ). We observe that the eigenvalue spectra are similar for the two architectures, both exhibiting large terms that characteristic of the curse of memory, which makes learning with stochastic gradient descent almost impossibleThe gradient Lipschitz constant $L$ of the loss equals the maximum Hessian eigenvalue . This quantity sets a bound $2/L$ for the maximum globally stable learning rate. While convergence might happen in a subspace, it is generally aligned with the top Hessian eigenspace near the solution .. However, their structure differs. For the fully connected linear RNN, the top eigenvectors are distributed over many coordinates, whereas they are concentrated on a few coordinates for the complex diagonal one. This feature aids adaptive optimization [e.g., 56]: adapting to large curvature is much easier for Adam when the pathological directions are aligned to the canonical basis. This is what we observe in practice. In Figure 4.C and D, we compare the effective learning rate used by Adam, which we compute by providing a vector of ones to the optimizer. For the dense linear RNN, the adaptive learning rates cannot compensate for the intricate coupling between components, resulting in very small learning rates. Conversely, the sensitivity of complex diagonal RNNs is concentrated on few parameters, which adaptive learning rates can compensate for, leading to targeted and overall larger learning rates, significantly speeding up learning. As a side note, the complex eigenvalues of the teacher come in conjugate pairs. However, during training, the complex values of the complex RNN are not conjugates of each other, thereby increasing Hessian diagonality. Finally, performing this analysis for the LRU, we find that the Hessian spectrum is similar to the diagonal setting and that the exploding dimensions of the Hessian are almost exclusively due to the angle parameter, consistently with our theoretical analysis; see Figure 9.

Before concluding this section, we investigate whether there exist eigenvalue distributions that break the diagonal structure of the Hessian, making optimization harder and increasing the pressure on eigenvalue reparametrization. We theoretically prove in Appendix B.2 the intuitive result that the more concentrated the eigenvalues are, the less diagonal the Hessian is. As a consequence, the gap between complex-valued diagonal networks and LRUs widens, but the former still greatly outperform their fully-connected counterpart; see Figure 10.

Signal propagation in deep recurrent networks at initialization

The ultimate goal of our theoretical quest is to gain practical insights into the training of recurrent networks. Specifically, we aim to verify whether the trends established theoretically and in controlled experiments hold in practice, by studying signal propagation at initialization.

The results are consistent with our theory. Complex-valued RNNs suffer from the curse of memory. LRUs almost perfectly mitigate this effect in the forward pass (Fig. 5.A) as well as in the backward pass (Fig. 5.B), except for the angle parameter $\theta$ , as expected. We also wonder whether layer normalization can replace the input normalization and reparametrization of the LRU. We find that it mitigates the memory-induced gradient explosion at the macroscopic level (Fig. 5.C), but it likely kills any learning signal for the smallest eigenvalues. Finally, the LSTM manages to keep the gradient norm constant over different level of memory, consistently with the intuition we developed in Section 3.2, although the LSTM-specific parameters exhibit smaller gradients than the feedforward parameters.

Conclusion

Vanishing and exploding gradients complicate learning recurrent networks, but solving these problems is not enough. We uncovered yet another difficulty of training such networks, which is rooted in their iterative nature and arises at the edge of dynamical stability. Reparametrizations and adaptive learning rates can effectively mitigate this behavior in practice, and diagonalizing the recurrence simplifies both. Our analysis additionally reveals the complexity of learning the angle of complex eigenvalues, which may explain why complex numbers were not found to be useful in most recent state-space model architectures .

A side finding of our study is the symbiosis between independent modules, which are here neurons and can more be more generally small heads, with adaptive learning rate optimizers in linear recurrent networks. Such a design pattern has promising properties: it facilitates online learning and compositional generalization , allows for high level of parallelization , and matches, at a high level, the modular organization of the cortex in cortical columns . Understanding how to increase the expressivity of small linear modules while keeping their great optimization properties constitutes a promising avenue for future research.

Acknowledgments

The authors thank Robert Meier, João Sacramento, Guillaume Lajoie, Ezekiel Williams, Razvan Pascanu, Imanol Schlag and Bobby He for insightful discussions. Nicolas Zucchet was supported by an ETH Research Grant (ETH-23 21-1) and Antonio Orvieto acknowledges the financial support of the Hector Foundation.

References

Appendix

This section introduces all the theoretical results we directly or indirectly mention in the main text, as well as provides a proof for them.

Most, if not all the calculations, that we will be doing in this section involves infinite sums. We state and prove two useful lemmas to simplify later calculations.

The proof naturally comes from separating the indices $n$ and $m$ in three sets: one in which the two are equals, one in which $n$ is larger and one in which $m$ is larger. This gives

In the same conditions as Lemma 1, we have

and using Lemma 1 to get the final result. ∎

A.2 The curse of memory: signal propagation analysis

We recall the assumptions that we stated in Section 2.2:

Linear diagonal recurrent neural networks. We restrict ourselves to networks satisfying $h_{t+1}=\lambda\odot h_{t}+x_{t+1}$ with $\lambda$ , $h_{t}$ and $x_{t}$ complex numbers. Without loss of generality, we focus on the one dimensional setting. We additionally consider $\lambda$ s with absolute values smaller than 1.

Infinite time horizon. We consider infinite sequences and initialize the network dynamics at $t_{0}=-\infty$ .

Without loss of generality, we can take $t=0$ given the wide-sense stationarity and infinite time horizon assumptions. Let us first remark that we have

We used Lemma 1 to obtain the last equality. In Section 2.2, we focused on the real case $\bar{\lambda}=\lambda$ , so this formula becomes Equation 5. If we further assume that the auto-correlation of $x$ decreases exponentially with decay rate $\rho$ , that is $R_{x}(\Delta)=\rho^{|\Delta|}$ , we can further simplify the last expression:

Differentiating the update $h_{t+1}=\lambda h_{t}+x_{t+1}$ with respect to $\lambda$ gives

Note that some extra technicalities are needed to justify these equations as $\lambda$ and $h_{t}$ are complex valued: these formulas hold as they would in the real-valued case as $h_{t}$ is an holomorphic function of $\lambda$ .

We can now compute the variance of the sensitivity of the hidden state with respect to the parameters.

Differentiating this quantity as a product gives

Note that Equation 6 in the main text is the real-valued version of that formula.

Let us now further simplify this equation when $R_{x}(\Delta)=\rho^{|\Delta|}$ . If we use this in the differentiated quantity before differentiating it, we get

Calculating this quantity manually is painful. Instead, we use the following trick. Its denominator is rather easy to compute, it is equal to $(1-\alpha\beta)^{3}(1-\rho\alpha)^{2}(1-\rho\beta)^{2}$ . We thus multiply it to the derivative of the function we want to compute in order to obtain a polynomial with unknown factors, and use polynomial regression tools to derive the resulting coefficients. Massaging the obtained expression to make it easier to compute the closed-form value of this quantity when $\rho=0$ and $\rho=1$ , we get

This is the quantity we plot on Figure 1.A, when $\lambda$ is real-valued. When $\rho=0$ , this quantity becomes

when $\rho=1$ . Additionally, it will diverge whenever $|\lambda|\rightarrow 1$ when $\rho<1$ , and when $\lambda\rightarrow 1$ when $\rho=1$ .

Regarding the backpropagation of errors to the inputs, the analysis we did in the main text also holds for complex number given that $h_{t}$ is an holomorphic function of $x_{t}$ and it thus behaves as the forward pass once replacing the input distribution with the one of output errors $\partial_{h_{t}}L_{t}$ .

We now turn to the non-diagonal case. For the sake of simplicity, we assume that recurrent matrix is complex diagonalizable and that its eigenvalues are all different. This will enable us to differentiate the eigenvalues and the eigenvectors. We consider dynamics of the form

As $A$ is complex diagonalizable, there exists a complex-valued matrix $P$ and a complex-valued vector $\lambda$ such that

The linear recurrent neural network considered above is equivalent to its diagonal version

We now differentiate $h_{t}$ w.r.t. to $A$ using the diagonal parametrization and obtain

Intuitively, the eigenvalues and eigenvectors move smoothly as we restricted ourselves to the case in which eigenvalues are singular. If this is not the case, math becomes trickier as the eigenvectors are not uniquely defined. We can study the behavior of those quantities in more detail, following Boeddeker et al. :

The $F$ introduced in the last equation is equal to

Importantly, those two quantities do not grow to infinity as the absolute value of the eigenvalues goes to 1, which means that we can consider those derivatives to be independent of $|\lambda|$ for the sake of our analysis. Note that the previous argument assumes that eigenvalues do not collapse.

It follows that the third term in the sum also corresponds to a low pass filtering of the inputs.

A.3 Impact of input normalization and parametrization

In this section, we consider a diagonal linear recurrent neural network of the form

with $\gamma(\lambda)$ the input normalization factor and $\lambda$ parametrized by a vector $\omega$ . Next, we study the effect of input normalization and reparametrization, first in the real-valued setting and then in the complex-valued one.

Appendix B Linear teacher-student task

This section is dedicated to detail the theoretical results behind our analysis of the teacher-student task, present all the details necessary to reproduce our empirical experiments, and provide additional analysis.

In this toy example, we are interested in learning a simple 1-dimensional linear recurrent neural network which follows the dynamics

to reproduce the hidden state $h_{t}^{*}$ of a teacher with recurrent parameter $\lambda^{*}$ . Note that we here allow all variables to be complex-valued. We take the loss to be

We have shown in Section A.2 that in the limit of $t\rightarrow\infty$ ,

Similar derivations hold for the other three terms in the loss. Grouping them gives the exact value of the loss. We omit the formula as it is not particularly insightful. In the case of constant inputs ( $\rho=1$ ), we have

In the case of i.i.d. inputs ( $\rho=0$ ), we have

This is the loss we plot on Figure 1.B and C.

Having a simple closed-form solution for the value the loss takes gives us the possibility to investigate in more detail what an optimal normalization and parametrization should be. We focus on the case $\rho=0$ .

For $\rho=0$ , the optimal normalization is $\gamma(\lambda)=\sqrt{1-|\lambda|^{2}}$ . Given that we now add an input normalization to the student, we must also add it to the teacher for the student to be able to fit it. The loss becomes

We now compute the Hessian of the loss w.r.t. $\omega_{\nu}$ . First, we can simplify our calculations by restricting ourselves to the case $\theta=\theta^{*}$ . The loss becomes

Differentiating this function a first time, we obtain

To keep that quantity constant, we thus have to solve the differential equation

which gives $\nu(\omega_{\nu})=\tanh(\omega_{\nu})$ .

We now move to the parametrization of $\theta$ . We have

At optimality ( $\theta=\theta^{*}$ and $\nu=\nu^{*}$ ), we have $\alpha(0)=(1-\nu^{2})^{2}$ , $\alpha^{\prime}(0)=0$ and $\alpha^{\prime\prime}(0)=2\nu^{2}$ , so that

The optimal parametrization thus has to satisfy

First, the parametrization that we derived for the general case in Section A.3.2, which additionally ignored the dependence of $\gamma$ on $\lambda$ , is relatively accurate. The only difference is the apparition of the extra $\nu$ term, which becomes insignificant in the long memory limit $\nu\rightarrow 1$ .

Second, the optimal $\theta$ parametrization has to be a function of $\nu$ , and thus $\omega_{\nu}$ , so the differential equation $\nu$ needs to satisfy changes. Yet, this considerably simplifies the calculation and there is no simple solution to that problem. One could still argue that the initial choice we made, that is to use a polar parametrization, is the issue. It could be, but most practical models end up using that choice so highlighting the limitations of this choice has important practical consequences.

In the rest of this section, we ignore the dependency of $\theta$ on $\nu$ , and consider the optimal parametrization in this setting to be

We now visualize the effect of input normalization and reparametrization on the loss landscape. We focus on two such reparametrizations:

the optimal one we derived in the previous Section (c.f. Equations 84 and 85), which is taylored to this specific setting.

We use this one-dimensional teacher-student setting to test whether having a parametrization that avoids exploding behaviors at optimality, such as the one we derived in Section B.1.2, facilitates learning. Figure 6 already hints towards the fact the basin of attraction of the global minima is either extremely narrow or that their number decreases as longer memories are considered, making learning more tedious. Figure 7 confirms it. In this figure, we plot the learning dynamics obtained using the Adam optimizer with a learning rate of $10^{-3}$ for $50$ k steps, starting from $\lambda_{0}=0.99\exp(i\pi/4)$ . We consider three different parametrizations of the angle:

The first one does not reparametrize the angle, the second one is the one used in the LRU and the third one is the optimal one we derived above. We use $\nu=\tanh(\omega_{\nu})$ to parametrize the magnitude in the three cases. We set $\lambda^{*}$ to $\lambda^{*}=0.99\exp(i\pi/100)$ . The $\theta$ landscape when $\nu$ is correct therefore corresponds to the ones plotted in the last two columns of Figure 6. This example shows that efforts to reduce the sharpness of the loss at optimality, as done in the last parametrization, inevitably make the loss flatter elsewhere and optimization impossible.

B.2 Structure of the Hessian at optimality

In Section 4, we argue that the Hessian at optimality is an important object to understand the learning dynamics in the linear teacher-student task we consider. We here provide some theoretical analysis of its structure in the complex diagonal setting, that is we consider a recurrent network of the form

with $\lambda$ , $b$ and $c$ complex vectors of size $n$ , with $n$ the number of hidden neurons, and $d$ a scalar. We additionally take the loss to be the mean-square error, which is also the one we use in our numerical experiments. Note that, as in our theoretical analysis of Section 2, we consider infinitely long sequences and wide-sense stationary inputs.

Recall that the Hessian of the loss is equal to

At optimality, only the first term remains, as $\partial_{h_{t}}L_{t}$ is 0 for all data points. Given that we have shown earlier, e.g. in Section A.2, that the most sensitive parameters to learn are the recurrent ones $\lambda$ , we focus on the Hessian with respect to these parameters in the following.

Before delving into more specific calculations, we make a few remarks on how to deal the Hessian when having complex-valued parameters. We will mostly leverage the fact that the loss $L$ is real-valued.

Before that, we recall a few facts about Wirtinger derivatives:

For $f(z)$ a complex-valued function of $z$ , the Wirtinger derivatives are defined as:

Leveraging the fact that $L$ is real-valued so that $\bar{L}=L$ , we have

Taken all together, this shows that the full complex Hessian, which contains all cross derivatives, has a similar structure to the real case.

In this section, we compute the full complex Hessian with respect to the recurrent eigenvalue $\lambda$ and defer the analysis of reparametrization to the next section.

The previous calculation applied to one sequence, we now take the expectation over the data:

We can now remark that this quantity is very similar to the one we have encountered in Section A.2, up to the presence of $b_{i}b_{j}$ , and can be simplified using Lemma 2. For conciseness, we note $S(\lambda_{i},\lambda_{j})$ the right-hand side of the last equation without the $b_{i}b_{j}$ factor. Putting this result back in the Hessian, we get

To gain further intuition of the behavior of this quantity, we take $R_{x}(\Delta)=\rho^{|\Delta|}$ , $\rho$ being a real number. A similar calculation to the one we did in Section A.2 gives

This formula being still hard to grasp, we visualize the magnitude of $S(\lambda_{i},\lambda_{j})$ on Figure 8. Interestingly, we observe this quantity is large when $\lambda_{i}$ and $\lambda_{j}$ are conjugate to each other and inputs are uncorrelated. However, as elements in the input sequence get more correlated ( $\rho\rightarrow 1$ ), this effect disappears and $|S|$ increases as one of the two eigenvalue gets closer to 1 in the complex plane. In both cases, the effect gets amplified as the magnitude of the eigenvalue increases.

Using the symmetry with the complex Hessian matrix, we now have all its components.

So far, we have computed the complex Hessian, which is not of direct use as we end up optimizing real numbers in practice. Here, we study the impact of different parametrizations of $\lambda$ on the Hessian. Given that this parametrization only affects $\lambda$ and not the other parameters in the network and that we only consider the Hessian at optimality here, computing the Hessian of those parameters reduces to left and right multiplying the Hessian by derivatives of $\lambda$ and $\bar{\lambda}$ with respect to these parameters. For future reference, we introduce

with $A_{ij}:=b_{i}b_{j}c_{i}c_{j}S(\lambda_{i},\lambda_{j})$ and $B_{ij}=b_{i}\bar{b}_{j}c_{i}\bar{c}_{j}S(\lambda_{i},\bar{\lambda}_{j})$ .

Given the intuition we gained on the structure of $S$ previously, and the fact that $A_{ij}\propto S(\lambda_{i},\lambda_{j})$ and $B_{ij}\propto S(\lambda_{i},\bar{\lambda}_{j})$ , we know that this block will have large components if the two corresponding eigenvalues are conjugate of each other or aligned to each other, or if one of them is close to 1.

The calculations for this parametrization are similar to the previous one, with the following differences:

B.3 Experimental details

We recall the setup we consider in Section 4:

with $\nu$ and $\theta_{0}$ two scalars that we control. This transformation has several benefits: we are guaranteed that the magnitude of $\lambda$ is within $[\nu,1]$ (and in $[\nu,\nu+(1-\nu)\tanh(1)]$ in the limit $n\rightarrow\infty$ as the eigenvalues of $A$ stay within the unit circle in that limit), and conjugate pairs of eigenvalues remain conjugate. This last point ensures that the resulting matrix remains real without having to change the eigenvectors.

We implement our experiments in JAX , using the default Flax implementation of RNNs and the LRU implementation of Zucchet et al. . We initialize RNNs in the same way we initialized the teacher, and initialize the eigenvalues of the LRU and other complex-valued networks with magnitude in $[\nu,1]$ and angle within $[-\theta_{0},\theta_{0}]$ .

Given that we are interested in the optimization properties of the different architectures, we only report training losses and do not perform any cross validation.

Here are additional details related to the different figures:

Figure 4: for panels A and B, we use $\nu=0.99$ and draw $A$ in a slightly different manner to the one described above (we directly draw the eigenvalues and eigenvectors so that we have two pairs of complex eigenvalues). We use automatic differentiation to compute the Hessian. For panels C and D, we use the same setup as described in Table 2, but keep the learning rate constant over the course of learning. We report the effective learning rate at the end of learning.

Figure 10: for panels A, B and C, we draw the magnitude and angle of $10$ $\lambda$ independently, uniformly in $[\nu,\frac{1+\nu}{2}]$ and $[-\theta_{0},\theta_{0}]$ . Importantly, this means that there are no conjugate pairs, which leads to more diagonal Hessian matrices at optimality than in Figure 4. For panel D, see Table 3.

As a rule of thumb, each LRU (or complex-valued diagonal network) experiment takes 3 minutes on a consumer-scale GPU (NVIDIA GeForce RTX 3070) and each RNN experiment takes 10 minutes on a CPU. The scans behind the results reported in the different figures require on the order of few hundreds run each. Including our preliminary exploration, the results we report in this section required 30 days of compute, one third of it on GPUs and two thirds on CPUs.

B.4 Additional analyses

In the main text, we only provide an analysis of the loss landscape for the fully connected linear recurrent neural network and its complex-valued diagonal counterpart. We here complete this result by performing the same analysis for the LRU.

The goal of this experiment is to better understand how the concentration of eigenvalues $\lambda$ affect the learning dynamics. For fully connected RNNs, there is no reason to expect a major change in behavior. However, it is different for diagonal RNNs. The theoretical analysis we have done in Section B.2 provides us with the following insights. When the elements in the input sequence are uncorralated, as it is the case here, the entries in the Hessian corresponding to two different eigenvalues increase if they are aligned or conjugate to each other, and if their magnitude is large. We therefore expect that, as the interval on which the angle of the teacher’s eigenvalues shrinks ( $\theta_{0}\rightarrow 0$ ), those eigenvalues will be more likely to be "similar" to each other. This results in large non-diagonal terms, as we confirm in Figure 10.A, B and C. The LRU suffers less from this problem thanks to its reparametrization, which reduces the overall magnitude of Hessian entries related to the magnitude, and partly the one of angle parameters (when it is a small positive number). As a consequence, the performance between these two architectures increases as $\theta_{0}\rightarrow 0$ , as seen on Figure 10.D.

Appendix C Signal propagation in deep recurrent neural networks at initialization

We here detail the experimental setup we used in Section 5. We take 1024 random sequences from the Wikipedia dataset and pass it through the BERT tokenizer and embedding layer. This provides us with a dataset of 1024 examples, we cut their length at 512. Each embedding has 724 features.

We consider networks with 4 blocks of the following structure: a recurrent layer, a non-linearity, a gated linear unit [57, GLU] and a skip connection. By default, we do not use any normalization layer, but when we do, as in Figure 5.C, we include one normalization layer before the recurrent layer and another one before the GLU. All the layers involved use 256 neurons. We additionally add a linear encoder at the beginning of the network, and a linear decoder at the end.

In Figure 5 we vary $\nu$ , which controls the magntitude of the eigenvalues of the recurrent Jacobian. More precisely, we sample those magnitudes in the interval $[\nu,(1+\nu)/2]$ . For the complex-valued diagonal RNN and the LRU, we use the LRU initialization. For the LSTM, we use the chrono initialization of Tallec and Ollivier : it initializes the bias of the forget and input gates such that, when the input $x$ and the hidden state $h$ are equal to 0, the time constant associated to $f$ is uniformly sampled from $[\frac{1}{1-\nu},\frac{2}{1-\nu}]$ and the input gate $i$ is equal to $1-f$ .

The loss that we use is a next-token mean-squared error, that is

with $\hat{x}_{t}(x_{1:t-1})$ the prediction of the network. The quantities reported in Figure 5 are the average squared value the hidden state or the gradient takes. The average is taken over all the sequences, but also over all neurons / parameters and over all time steps. Gradients are computed on batches of size 8.