Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Zhen Qin, Songlin Yang, Yiran Zhong

Introduction

Sequence modeling is a fundamental problem in various domains such as natural language processing devlin-etal-2019-bert ; liu2019roberta ; qin2023scaling ; qin-etal-2023-linear ; liu2022neural , time series analysis zhou2021informer , computer vision vit ; vivit ; Sun2023Tpami ; lu2022linear , and audio processing gong21b_interspeech ; akbari2021vatt ; sun2022locality . Prior to the invention of Transformers vaswani2017attention , RNN and its variants were the primary selections of architectures for sequence modeling, and have been widely used in machine translation cho-etal-2014-learning , stock price prediction selvin2017stock , weather forecasting salman2015weather , speech recognition miao2015eesen , and etc.

RNNs have two main drawbacks: slow sequential training and limited capability in modeling long-term dependencies. With the swift development of deep learning and the pervasive use of GPUs, these drawbacks prevent it from flourishing in modern long-sequence modeling tasks. Meanwhile, Transformers vaswani2017attention have rapidly gained popularity and now dominate various research areas in sequence modeling due to their better abilities in parallel training and long-term dependency modeling. However, Transformer’s quadratic time complexity makes long sequence modeling expensive. On the other hand, RNN offers linear complexity and serves as an ideal choice for long sequence modeling. This work aims to address these RNN drawbacks, revitalizing their applicability in long-sequence modeling tasks.

To address the training inefficiency problem, we turn to more efficient RNN variants that employ element-wise linear recurrence (ELR) relations iclr18 . ELR provides two main advantages: (i) Removing nonlinearities in the recurrence enables parallelized training. (ii) By assuming independence between distinct hidden states, it enables efficient hidden state updates (through element-wise product instead of matrix multiplication) lei-etal-2018-simple ; s4d . Notably, ELR has been used in many modern linear RNN models, including the diagonalized versions of structured state-space models s4 (S4) dss ; s4d ; s5 and RWKV rwkv . In recent advancements, numerous studies have incorporated gating mechanisms into the outputs of linear recurrence layers gss ; h3 ; mega ; pretrainingwoattn ; rwkv , similar to the output gates in LSTMs and leading to considerable performance gains. However, most current studies overlook the significance of the forget gate, which is often regarded as the most important gate in LSTMs Greff2015LSTMAS ; forgetgate . In this work, we underscore the importance of employing forget gates in linear RNNs and adopt gated linear RNNs for both efficiency and high performance.

To effectively capture long-term dependencies in gated RNNs, it is crucial to maintain high forget gate values close to one gu-improving . However, gates in saturated regimes (i.e., close to zero or one) suffer from the gradient vanishing issue gu-improving . Moreover, if all forget gate values are close to one, RNNs will not be able to effectively forget irrelevant information, compromising their ability to model short-term dependencies. To address these challenges, we introduce Hierarchically Gated Recurrent Units(HGRU). In HGRU, we add an additive learnable value, referred to as the lower bound, to the original forget gate value, effectively mitigating the issue of saturated gates gu-improving by pushing gate activations away from the saturated regimes. Furthermore, we design the lower bounds to increase monotonically as we move up the layers of the RNN. This ensures that the forget gate values in the lower layers remain relatively small, enabling the necessary forgetting of past information for modeling short-term dependencies. In contrast, in the uppermost layer, the forget gate values approach one, facilitating the effective modeling of long-term dependencies. Our proposed model has proven to be highly efficient and effective, as demonstrated by its outstanding performance in language modeling, image classification, and long-range arena benchmarks.

Related work

metaformer abstracts self-attention (SA) as token mixing, thereby transforming the Transformer architecture into MetaFormer. MetaFormer comprises essential components such as token mixer, channel mixer, residual connections, and LayerNorm. This abstraction highlights that the success of Transformers does not solely rely on SA but rather on the holistic integration of these components. Notably, token mixers can be replaced with simpler alternatives like pooling layers without compromising the model’s performance in the context of vision transformer. For sequence modeling tasks, unified provides a comprehensive analysis and discussion of different token mixing strategies. Two prominent contenders, long convolution and linear recurrence, show promise as replacements for SA modules in long sequence modeling due to their superior asymptotic time complexity and competitive performances. In long convolution models Li2022WhatMC ; simplelongconv ; qin2023toeplitz ; hyena , the kernel size matches the input sequence length, enabling a broader context compared to traditional convolutions. Training is accomplished using the efficient $\mathcal{O}(n\log n)$ fast Fourier transforms (FFT) algorithm. However, long convolutions face challenges such as the need for causal convolution inference, which requires caching all historical computations similar to the key-value (KV) cache in SA. This can lead to memory limitations when processing long sequences. Moreover, the inference complexity of long convolutions remains higher than that of RNNs. These factors make linear RNNs a more suitable alternative to replace SA in long-sequence modeling. TransNormerLLM qin2023scaling scales efficient token mixing in large language models to achieve competitive performance and superior training and inference efficiency compared to transformer-based models.

Element-wise linear recurrence.

The slower training speeds of traditional RNNs can be attributed to two main factors: (i) The updating of the hidden state involves full matrix multiplication. (ii) The presence of nonlinearity within the recurrence prevents parallel computation. To tackle the first issue, lei-etal-2018-simple introduced a simplified interaction between hidden states. This allowed the hidden state update to be performed using an element-wise product instead of full matrix multiplication. They demonstrated that this approach is notably fast when the (nonlinear) recurrence for each dimension is fused within a single CUDA kernel. Likewise, for the linear case, diagonalized versions of S4 dss ; s4d have also exhibited speed improvements over S4 by leveraging element-wise recurrence. Regarding the second challenge, the ability to capture nonlinear dependencies on past data can be achieved by stacking multiple linear recurrence layers interleaved with nonlinear MLP blocks. This indicates the potential to eliminate nonlinearity, as suggested by strongtypedrnn ; iclr18 ; 2110.13985 . Empirical support for this strategy’s effectiveness came later, as demonstrated by DBLP:conf/nips/GuJGSDRR21 ; s4d ; s5 ; lru ; h3 ; rwkv . DBLP:journals/corr/abs-2307-11888 further highlighted that such an architecture still possesses Universal Approximator properties, thus justifying the employment of linear recurrence. By excluding nonlinearity, iclr18 ; s5 showed that the parallel scan algorithm can be used for parallel training.

Linear recurrence can be broadly categorized into exponential moving averages (EMA) and gating schemes, as noted by iclr18 . The key difference is whether the decay rate is data-dependent. Models such as S4 s4 , S4D s4d , MEGA mega , RWKV rwkv , and LRU lru utilize the EMA approach, where the decay rate is static for all time steps (i.e., data-independent), while our model uses a data-dependent dynamic decay rate through the use of the forget gate. We remark on the importance of incorporating a data-dependent decay rate, which is largely ignored by current works in linear RNNs. Although liquid S4 liquids4 uses a dynamic transition matrix (which amounts to a data-dependent decay rate), it employs a limited form for FFT-based training. Our model does not have the convolutional view and thus cannot use FFT for training but allows the use of parallel scans.

The field of linear Transformers and linear RNNs exhibits a close relationship. xfmrsarernns shows that linear Transformers can be reformulated as RNNs during auto-regressive decoding, revealing similarities to the update rules observed in fast weight additive outer products Schmidhuber1992LearningTC ; Schlag2021LinearTA . These updates can be seen as a special case of element-wise linear recurrence, where forget gate values are consistently set to one across time and hidden states are two-dimensional. However, this formulation in linear Transformers lacks the ability to forget irrelevant information, resulting in the attention dilution issue qin-etal-2022-devil . To address this limitation, Schlag2021LinearTA introduced the delta rule to forget values associated with the current write key by removing the corresponding value before adding the new value. Alternatively, rfa ; mao-2022-fine proposed gating mechanisms similar to those in gated RNNs to facilitate the forgetting of irrelevant information.

Long-term dependencies in RNNs.

RNNs fall short in long-term dependency modeling, which is commonly attributed to the gradient vanishing issue. Three methods are typically applied to mitigate this issue. (i) Gating mechanisms DBLP:journals/neco/GersSC00 ; Hochreiter2001GradientFI ; gru ; onlstm ; gu-improving , which are believed to be crucial to the success of LSTMs, use additive (instead of multiplicative) hidden state update rules to improve gradient flow. (ii) Regularizing or initializing the eigenvalues of the recurrent weight matrix (close) to one via identity matrices Le2015ASW or unitary matrices DBLP:conf/icml/ArjovskySB16 . In the diagonal linear RNN case, the eigenvalues coincide with the element-wise decay rates, and LRU lru uses randomized linear algebra techniques to initialize eigenvalues to be close to one. lru also interestingly points out that many modern state-space models use a very small time step value on initialization for discretization, resulting in eigenvalues or decay rates close to one. (iii) Adding skip connections between distant time steps to allow shortcuts for gradient flow DBLP:conf/icml/KoutnikGGS14 ; dilatedrnn ; DBLP:conf/iclr/ChungAB17 .Our approach combines (i) and (ii), which improves gating mechanisms with a regularized dynamic decay rate that approaches one in the upper layer.

Method

Our proposed Hierarchically Gated Recurrent Network (HGRN) is depicted in Figure 1. It has multiple stacked layers, each of which consists of a token mixing module HGRU and a channel mixing module GLU (Gated Linear Unit glu ).

2 HGRU exploration

We begin with a simple gated linear recurrent layer, which is defined as:

where $\odot$ denotes the element-wise product. Following the terminology used in the RNN literature, we refer to $\mathbf{f}_{t}$ and $\mathbf{i}_{t}$ as the forget and input gates, respectively. It is worth noting that $\mathbf{f}_{t}$ and $\mathbf{i}_{t}$ depend only on $\mathbf{x}_{t}$ and not on $\mathbf{h}_{t-1}$ . This characteristic enables the use of the parallel scan algorithm iclr18 ; s5 , otherwise it is infeasible. We then make the following changes toward our final HGRU step by step.

Lower bound on forget gate values.

We remark that there is a difference in the use of cummax between our model and ON-LSTM. In ON-LSTM, cummax is applied to the hidden state dimension within a single layer, while in our case, we apply cummax on the layer dimension across different layers to enable upper layers to model long-range dependencies.

Finally, $\lambda_{t}$ in the $k$ -th layer is parameterized as follows:

Compared to before (i.e., without lower bounds), to achieve the same forget rate value $\bar{\gamma}$ closed to one, $\mu_{t}$ will be pushed away from the Sigmoid activation function’s saturated regions (i.e., near one),

thereby mitigating the gradient vanishing issue gu-improving and making gradient-based optimization easier.

Tying input and forget gates.

To reduce the number of parameters, it is common to use leaky units, i.e., tying the input gate with the forget gate using $\mathbf{i}_{t}=1-\mathbf{f}_{t}$ , which has a close relationship to the discretization of continuous-time system DBLP:conf/iclr/TallecO18a and exponential moving average Hunter1986TheEW , and has been proven effective empirically gru ; Greff2015LSTMAS . To allows for a clear interpretation of encoding relative position information, we only apply this strategy on the magnitude argument:

Output gates and projection.

The addition of gates to the output of the recurrence layer has been shown to be effective in state-space models gss ; h3 ; mega ; pretrainingwoattn . Motivated by this, we incorporate an output gate before performing the output projection as follows and get HGRU:

3 Token mixing perspective of HGRU

We provide the token mixing perspective of HGRU similar to huang2023encoding . Expanding Equation 2, we have:

So the token mixing module can be formed as follows:

Note that the token mixing matrix $\mathbf{A}$ can be decomposed into two parts $\mathbf{A}=\mathbf{\Lambda}\odot\mathbf{\Theta}$ :

This decomposition means that the Token mixing matrix $\mathbf{\Lambda}$ can be decoupled into two independent modules, where $\mathbf{\Lambda}$ models the long-distance dependency and $\mathbf{\Theta}$ , a Toeplitz matrix, models the relative positional relationship and enhanced expressiveness. Note that if $\mathbf{\Theta}$ depends on the input, then the matrix $\mathbf{\Lambda}$ will no longer be a Toeplitz matrix, thus unable to capture relative position information. It can be also viewed as a RoPE-enhanced attention mechanism: $\mathbf{\Lambda}$ corresponds to the attention matrix but the attention score here is the cumulative product of data-dependent decay rates; $\mathbf{\Theta}$ directly corresponds to RoPE.

Experiments

We conduct a comparative analysis between our proposed HGRN and four widely adopted sequence modeling structures, i.e., attention-based, MLP-based, FFT-based, and state-space-based. We evaluate HGRN on the WikiText-103 dataset merity2017pointer and the Pile (pile, ) dataset for autoregressive language modeling, as well as the length extrapolation ability. To assess the accuracy and efficiency of our model in handling long-term dependencies, we utilize the LRA benchmark lra . Additionally, we showcase the robustness of HGRN in computer vision task on the ImageNet-1k dataset.

We implement our models in Pytorch paszke2019pytorch and train them on 8 Nvidia A100 GPUs. For HGRN, we found that fusing element-wise recurrence into a single CUDA kernel results in fast running speed in practice. iclr18 also found that unless the sequence length is sufficiently large, the parallel scan’s implementation is not necessarily faster than the sequential scan. As such, we use a CUDA-based sequential scan for implementation; however, our model has the potential to model very long sequences through the use of a parallel scan.

We adopt the same training configuration for all competitors, including batch size, learning rate, training epochs or iterations, etc. We list detailed hyper-parameters in the Appendix. For the autoregressive language modeling, we conducted three sets of experiments. Firstly, we validated the performance of two different-scale models on the Wikitext-103 dataset. We used the TNN configuration to verify the performance of the model at around 44m, and the Hyena configuration to verify the performance of the model at around 125m. To evaluate the performance of larger-scale models, we trained a 1b Transformer and HGRN on the Pile dataset using 10b tokens. To assess the performance in downstream tasks, we trained HGRN models of 150m, 350m, and 1b on the Pile dataset using 100b tokens and conducted zero-shot evaluations on downstream tasks.

For the LRA benchmark, We report results on all 6 tasks. For the image classification on the ImageNet-1k dataset, We integrate HGRN into the DeiT pmlr-v139-touvron21a structure, we replace the transformer layers with our HGRN modules. It is compared to the performance of the vanilla DeiT on the ImageNet-1K dataset for image classification.

2 Results

Autoregressive language modeling stands as a prominent task within the field of natural language processing, as it serves as a measure of a language model’s causal inference capability. This task requires the model to estimate the probability distribution of the subsequent token based on the previously seen tokens.

We show the performances of the autoregressive language modeling in table 1 and table 2. Compared to transformer-based methods, HGRN performs favourably than most efficient variants of the vanilla transformer such as FLASH hua2022transformer , 1+elu katharopoulos2020transformers , Performer choromanski2020rethinking and cosFormer zhen2022cosformer . Also, HGRN achieves better results than the MLP-based methods with a notable margin. Nevertheless, HGRN performs similarly to the original transformer vaswani2017attention . Finally, HGRN shares similar concepts with RNN-based such as S4 gu2022efficiently , DSS dss , GSS gss , RWKV rwkv , and LRU lru , our HGRN also achieves superior performance to all RNN-based methods. This provides evidence HRGN may be an effective method in LM We also report the extrapolation ability of HGRN compared to previous methods in Table 14.

We also trained a 1b model on the Pile dataset and compared it with LRU and Transformer. Specifically, our training parameters included a sequence length of 1024, batch size of 96, 100k updates, and a learning rate of 5e-4. It can be seen that HGRN still performs better at the 1b scale. Additionally, we trained 100b tokens of HGRN on the Pile dataset at 150m, 350m, and 1b sizes, and evaluated them against open-source Transformer-based models in downstream tasks. We selected Comparison on Commonsense Reasoning and Super GLUE tasks, and all evaluations were done using the lm-evaluation-harness eval-harness . HGRN achieves comparable performance to Transformer-based models when consuming only 1/3 of the tokens.

Long Range Arena

LRA tay2020long is proposed as a comprehensive evaluation for assessing the performances of models in processing long-term dependencies in various sequential modeling tasks. We show a performance comparison between HGRN and existing methods in Table 6. HGRN achieves comparable results with other SOTA methods.

Image Classification

The image classification results on the ImageNet-1K dataset are presented in Table 7. Notably, with comparable parameter sizes, our proposed HGRN model demonstrates superior performance compared to previous methods such as TNN and the vanilla transformer. It demonstrates the capability of HGRN in modeling visual modalities.

3 Ablation Study

We conducted ablation studies in the smallest-scaled setting (i.e., TNNqin2023toeplitz ’s setting on WikiText103 dataset) to thoroughly verify the effectiveness of each of our proposed components in HGRN. Experiments were conducted on the Pile dataset using a 1b model with 10b tokens for the forget gate experiment.

In table 8, we demonstrate the role of forget gate. From table 8, we observe that removing the forget gate significantly decreases the performance of HGRN, while adding a forget gate to LRU improves performance. On the other hand, using a data-independent forget gate (only lower bound) leads to lower performance compared to a data-dependent forget gate.

The influence of input gate and output gate

Table. 9 validates the effectiveness of using output gates and tying input and forget gates. w/o input gate means to remove the $1-\lambda_{t}$ term. w/o output gate means remove the left branch of HGRN in figure 1. Our design achieves the best performance.

The influence of lower bounds in forget gate values

We demonstrate the effectiveness of introducing a lower bound in Table 10 and Table 13. From Table 10, we observe that gating (i.e., without lower bound) is more critical than the lower bound (i.e., only lower bound). Combining gating and the lower bound consistently provides benefits, but the most significant improvement arises from the monotonically increasing lower bound. This aligns with the intuition that lower layers should primarily focus on nearby tokens, while upper layers should attend more broadly to capture long-term dependencies poli2023hyena .

Table 13 highlights the essential role of the lower bound in long sequence processing tasks. In these tasks, the model’s performance is notably poor and sometimes fails to converge without the lower bound. It is worth noting that language modeling tasks do not require extensive long-term dependencies, which explains why the model performs well even without the lower bound. However, in the task of LRA, the ability to capture long-term dependencies is crucial for achieving satisfactory performance.

The influence of complex-valued recurrence

Table 11 validates the utility of incorporating complex values in element-wise linear recurrence. Additionally, the experiments show that the phase argument $\theta$ should not be data-dependent.

4 Analysis on forget gate values

We present the distributions of forget gate values across layers for different methods in Table 12 and visualize the histogram of each layer in Figure 2, trained on the autoregressive language modeling task. The results demonstrate that the addition of lower bounds effectively increases the average forget gate values in higher layers (5-6). Notably, the medium forget gate values in the highest layer reach 0.98, enabling the modeling of long-term dependencies.

It is interesting to note that the average forget gate values of the LRU model consistently exceed those of our variant model without lower bounds, as per their eigenvalues. However, despite this, the language modeling performance of LRU is lower than that of our variant. Specifically, LRU scored 24.71, while our variant scored 31.12. This suggests that using data-dependent gates to selectively retain relevant information is advantageous, rather than relying on data-independent forget gate values across all time steps.

Conclusion

In this work, we have shown that gated linear RNNs could obtain impressive performance across different tasks and modalities without compromising efficiency. We highlighted the significance of the forget gate for linear RNNs in language modeling and emphasized the importance of an additive lower bound on forget gate values for modeling long-term dependencies.

Acknowledgement

This work is partially supported by the National Key R&D Program of China (NO.2022ZD0160100).

Limitations and broader impact

Our empirical evaluation of HGRN remains on a smaller scale compared to other large-scale models. Potentially negative social consequences include the misuse of brain models for unsuitable purposes or applications, which must be prohibited by appropriate rules. In the era of large language models, the inference cost is the key limitation of transformer-based models. RNNs provide a solution with lower inference costs. This could potentially lead to a significant evolution in the field.

References

Appendix

In this appendix, we examine the extrapolation ability of HGRN and provide the training and inference speed comparison of HGRN and existing efficient sequence modeling methods. We also illustrate the forget rates of each layer on a trained language model of HGRN. We also report the extrapolation ability of HGRN compared to previous methods in Table 14.

In this section, we tested HGRN ’s extrapolation ability by directly inferring the model with a variety of sequence lengths. As shown in Table 14, our method has the ability to train short and test long.

2 Speed comparison

In this section, we benchmark the speed of our method on the LRA benchmark. Our method achieves state-of-the-art training and inference speed.

3 Visualization

In this section, we visualize the forget rates of each layer on a model trained on language modeling tasks.

4 Configurations

We list detailed hyper-parameters of our experiments here.