RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Bolun Wang, Johan S. Wind, Stanislaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Qinghua Zhou, Jian Zhu, Rui-Jie Zhu

Introduction

Deep learning has greatly advanced artificial intelligence, impacting a range of scientific and industrial uses. These often involve complex sequential data processing tasks such as natural language understanding, conversational AI, time-series analysis, and indirectly sequential formats like images and graphs Brown et al. (2020); Ismail Fawaz et al. (2019); Wu et al. (2020); Albalak et al. (2022). Predominant among these techniques include RNNs and Transformers Vaswani et al. (2017), each with specific benefits and drawbacks. RNNs require less memory, particularly for handling long sequences. However, they suffer from the vanishing gradient problem and non-parallelizability in the time dimension during training, limiting their scalability Hochreiter (1998); Le and Zuidema (2016).

Transformers emerged as a powerful alternative, adept at managing local and long-range dependencies and supporting parallelized training Tay et al. (2022). Models such as GPT-3 Brown et al. (2020), ChatGPT OpenAI (2022); Kocoń et al. (2023), LLaMA Touvron et al. (2023), and Chinchilla Hoffmann et al. (2022) showcase the potential of Transformers in NLP. However, the self-attention mechanism’s quadratic complexity makes it computationally and memory intensive for tasks involving long sequences and constrained resources. This has stimulated research to enhance Transformers’ scalability, sometimes sacrificing some of their effectiveness Wang et al. (2020); Zaheer et al. (2020); Dao et al. (2022a).

To tackle these challenges, we introduce the Receptance Weighted Key Value (RWKV) model, combining the strengths of RNNs and Transformers while circumventing key drawbacks. RWKV alleviates memory bottleneck and quadratic scaling associated with Transformers Katharopoulos et al. (2020) with efficient linear scaling, while maintaining the expressive properties of the Transformer, such as parallelized training and robust scalability. RWKV reformulates the attention mechanism with a variant of linear attention, replacing traditional dot-product token interaction with more effective channel-directed attention. This implementation, without approximation, offers the lowest computational and memory complexity; see Table 1.

The motivation behind RWKV is to balance computational efficiency with expressive capacity in neural networks. It offers a solution for handling large-scale models with billions of parameters, exhibiting competitive performance at a reduced computational cost. Experiments suggest RWKV addresses scaling and deployment challenges in AI, especially for sequential data processing, pointing towards more sustainable and efficient AI models.

Our contributions in this paper are as follows:

The introduction of RWKV, a novel architecture combining RNNs and Transformer advantages while mitigating their limitations.

Detailed experiments demonstrating RWKV’s performance and efficiency on benchmark datasets for large-scale models.

The release of pretrained models, from 169 million to 14 billion parameters, trained on the Pile Gao et al. (2020); Biderman et al. (2022).https://huggingface.co/RWKV

Background

Here we briefly review the fundamentals of RNNs and Transformers.

Popular RNN architectures such as LSTM Hochreiter and Schmidhuber (1997) and GRU Chung et al. (2014) are characterized by the following formulation (shown for LSTM, others can be reasoned similarly):

Although RNNs can be factored into two linear blocks (WW and UU) and an RNN-specific block (1)–(6), as noted by Bradbury et al. (2017), the data dependency relying on previous time steps prohibits parallelizing these typical RNNs.

2 Transformers and AFT

Introduced by Vaswani et al. (2017), Transformers are a class of neural networks that have become the dominant architecture for several NLP tasks. Instead of operating on sequences step-by-step like RNNs, Transformers rely on attention mechanisms to capture relationships between all input and all output tokens:

where the multi-headness and scaling factor 1dk\frac{1}{\sqrt{d_{k}}} is omitted for convenience. The core QKQK^{\top} multiplication is an ensemble of pairwise attention scores between each token in a sequence, which can be decomposed as vector operations:

AFT Zhai et al. (2021), alternately formulates

where {wt,i}RT×T\{w_{t,i}\}\in R^{T\times T} is the learned pair-wise position biases, and each wt,iw_{t,i} is a scalar.

Inspired by AFT, RWKV takes a similar approach. However, for simplicity, it modifies the interaction weights so that it can be transformed into an RNN. Each wt,iw_{t,i} in RWKV is a channel-wise time decay vector multiplied by the relative position and traced backward from current time as it decays:

where w(R0)dw\in(R_{\geq 0})^{d}, with dd the number of channels. We require ww to be non-negative to ensure that ewt,i1e^{w_{t,i}}\leq 1 and the per-channel weights decay backwards in time.

RWKV

The RWKV model architecture is defined by four fundamental elements that are intrinsic to the time-mixing and channel-mixing blocks:

RR: The Receptance vector acts as the receiver of past information.

WW: The Weight signifies the positional weight decay vector, a trainable parameter within the model.

KK: The Key vector performs a role analogous to KK in traditional attention mechanisms.

VV: The Value vector functions similarly to VV in conventional attention processes.

These core elements interact multiplicatively at each timestep, as depicted in Figure 2.

The RWKV model is composed of stacked residual blocks. Each block consists of a time-mixing and a channel-mixing sub-block, embodying recurrent structures to leverage past information.

This model uses a unique attention-like score update process, which includes a time-dependent softmax operation improving numerical stability and mitigating vanishing gradients (for rigorous proof, see Appendix H). It ensures that the gradient is propagated along the most relevant path. Additionally, layer normalization Ba et al. (2016) incorporated within the architecture aids in stabilizing the gradients, effectively addressing both vanishing and exploding gradient issues. These design elements not only enhance the training dynamics of deep neural networks but also facilitate the stacking of multiple layers, leading to superior performance over conventional RNN models by capturing complex patterns across different levels of abstraction (see also Appendix I).

In this architecture, all linear projection vectors (RR, KK, VV in time-mixing, and RR^{\prime}, KK^{\prime} in channel-mixing) involved in computations are produced by linear interpolation between current and previous timestep inputs, facilitating a token shift.

The vectors for time-mixing computation are linear projections of linear combinations of the current and previous inputs of the block:

The token shift is implemented as a simple offset in the temporal dimension at each block using the PyTorch (Paszke et al., 2019) library as nn.ZeroPad2d((0,0,1,-1)).

1.2 WKV Operator

The computation of the WKVWKV operator in our model parallels the method used in Attention Free Transformer (AFT) Zhai et al. (2021). However, unlike AFT where WW is a pairwise matrix, our model treats WW as a channel-wise vector that is modified by relative position. In our model, this recurrent behavior is defined by the time-dependent update of the WKVWKV vectors, formalized in the following equation:

To circumvent any potential degradation of WW, we introduce a vector UU that separately attends to the current token. More information about this can be found in Appendix I.

1.3 Output Gating

Output gating is implemented in both time-mixing and channel-mixing blocks using the sigmoid of the receptance, σ(r)\sigma(r). The output vector oto_{t} post the WKVWKV operator is given by:

In the channel-mixing block, a similar operation is performed:

where we adopt the squared ReLU activation function So et al. (2021).

2 Transformer-like Training

RWKV can be efficiently parallelized using a technique called time-parallel mode, reminiscent of Transformers. The time complexity of processing a batch of sequences in a single layer is O(BTd2)O(BTd^{2}), primarily consisting of matrix multiplications WλW_{\lambda}, where λ{r,k,v,o}\lambda\in\{r,k,v,o\} (assuming BB sequences, TT maximum tokens, and dd channels). In contrast, updating attention scores wkvtwkv_{t} involves a serial scan (see Appendix D for more detail) and has complexity O(BTd)O(BTd).

The matrix multiplications can be parallelized similarly to WλW_{\lambda}, where λ{Q,K,V,O}\lambda\in\{Q,K,V,O\} in conventional Transformers. The element-wise WKVWKV computation is time-dependent but can be readily parallelized along the other two dimensions Lei et al. (2018)For extremely long sequences, more sophisticated methods such as Martin and Cundy (2017) that parallelize over sequence length could be used..

3 RNN-like Inference

Recurrent networks commonly utilize the output at state tt as input at state t+1t+1. This usage is also observed in the autoregressive decoding inference of language models, where each token must be computed before being passed to the next step. RWKV takes advantage of this RNN-like structure, known as time-sequential mode. In this context, RWKV can be conveniently formulated recursively for decoding during inference, as demonstrated in Appendix D.

4 Additional Optimizations

To address inefficiencies in the WKVWKV computation arising from the sequential nature of the task when using standard deep learning frameworks, we have developed a custom CUDA kernel. This kernel enables the execution of a single compute kernel on training accelerators, while all other parts of the model, such as matrix multiplications and point-wise operations, are already inherently parallelizable and efficient.

During the initial stage of training a transformer model Vaswani et al. (2017), we observe that the embedding matrix undergoes slow changes, presenting a challenge for the model to move away from its initial noisy embedding state. To address this issue, we propose an approach that involves initializing the embedding matrix with small values and subsequently applying an additional LayerNorm operation. This accelerates and stabilizes the training process, allowing for the training of deep architectures with post-LN components. The effectiveness of this approach is demonstrated in Figure 9, illustrating improved convergence by enabling the model to quickly transition away from the initially small embedding. This is achieved through small changes occurring in a single step, which subsequently lead to substantial alterations in directions and further notable changes after the LayerNorm operation.

Building on principles from previous works (He et al., 2016; Jumper et al., 2021), we adopt an initialization strategy where parameters are set to values resembling an identity mapping while breaking symmetry to establish a clear information flow. The majority of weights are initialized to zero, and linear layers do not employ biases. Detailed formulas are given in Appendix E. We observe that the choice of initialization plays a crucial role in both the speed and quality of convergence (refer to Appendix F for further details).

5 Implementation

RWKV is implemented using the PyTorch Deep Learning Library Paszke et al. (2019). We integrate additional optimization strategies inspired by DeepSpeed Rasley et al. (2020) into the system, improving its efficiency and scalability.

The model begins with an embedding layer, as detailed in Section 3.4. Following this are several identical residual blocks arranged sequentially. These are depicted in Figures 2 and 3 and adheres to the principles outlined in Section 3.1.1. After the last block, a simple output projection head, consisting of a LayerNorm Ba et al. (2016) and a linear projection, is employed for logits generation for next-token prediction and computation of the cross-entropy loss during training.

Trained Models and Computing Costs

To demonstrate the scalability of RWKV, we train six models ranging from 169 million to 14 billion parameters as shown in Table 2. All models are trained for one epoch (330 billion tokens) on the Pile (Gao et al., 2020; Biderman et al., 2022).

The number of parameters for each model is computed using the formula: # parameters=2VD+13D2L+D(11L+4)\text{\# parameters}=2VD+13D^{2}L+D(11L+4) where VV = 50277 is the vocabulary size, DD represents the Model Dimension and LL corresponds to the number of layers. FLOPs is for a forward pass for one token. It was calculated as 2(2VD+13D2L)2(2VD+13D^{2}L), which is the twice (add and multiply) the number of parameters in linear layers. The backwards pass FLOPs can be approximated as twice that of the forward pass, giving a total of 6(2VD+13D2L)6(2VD+13D^{2}L) FLOP per token. Notably, this matches the standard formula for FLOP calculations in transformers Kaplan et al. (2020): FLOP=6[# tokens][# parameters]\text{FLOP}=6\cdot[\text{\# tokens}]\cdot[\text{\# parameters}].

For training, we use the standard Adam optimizer without weight decay, use bfloat16 precision, and train with a context length of 1024 tokens. Further details on hyperparameters are in Appendix G. Diverting from standard practice for transformers, we apply exponential decay to our learning rate. We also incorporate the auxiliary loss introduced by PaLM Chowdhery et al. (2022), supplementing the standard cross-entropy loss function. This auxiliary loss encourages the softmax normalizer to approximate zero closely. As for the learning rate schedule, it remains constant for the initial iterations, and subsequently decays exponentially.

2 Scaling Laws

Scaling laws Kaplan et al. (2020); Henighan et al. (2020); Hoffmann et al. (2022); Muennighoff et al. (2023) in language models refer to the mathematical relationships that describe how the performance of a language model changes with respect to various factors. These factors can include the model size (NN), dataset size (DD), or the optimally allocated compute budget (CminC_{\rm min}). Scaling laws are important for two primary reasons: they allow us to make predictions and plans regarding the costs and performance of large models before they are trained via interpolation and extrapolation Black et al. (2022); Le Scao et al. (2022) and the contexts in which they fail provides rich feedback on important areas for future research Wei et al. (2022a); Biderman et al. (2023a).

Previous work on scaling laws for RNNs has claimed that LSTMs do not strictly follow the same log-log linear scaling that transformers do Kaplan et al. (2020). We train 45 RWKV models for a variety of pairs (dataset, parameters) and find that RWKV does follow the same general form of the scaling law that is well established for transformers. Figure 4 shows our results for loss as a function of compute, with the linear fit to the Pareto optimal points holding an r2r^{2} value of 0.9940.994. Even when we extrapolate our curve an additional order of magnitude (blue), we find an extremely good fit with an r2r^{2} of 0.8750.875.

Evaluations

Having demonstrated the scalability of RWKV models in the previous section, we now turn our attention to their competitiveness with traditional transformers. We focus on two questions:

Is RWKV competitive against quadratic transformer architectures with the same amount of compute?

Does increasing the context length of RWKV yield better language modeling loss when RWKV models are trained for context lengths that most open-sourced quadratic transformers cannot efficiently process?

1 NLP Evaluations

To demonstrate that RWKV is competitive with traditional transformers at NLP tasks, we compare with similarly sized models trained for a similar number of tokens (Pythia Biderman et al. (2023b), OPT Zhang et al. (2022) and BLOOM Scao et al. (2022)). All RWKV models were trained for one epoch on the Pile (330B tokens), which is close but not identical to the amount of tokens the Pythia, OPT, and BLOOM models were trained for. Consequently, we compare our models on a FLOP-matched basis. We avoid comparing with model trained in the Chinchilla-optimal regime (Hoffmann et al., 2022) or the overtrained regime (Touvron et al., 2023) to ensure the most equitable comparison.

We report results on ARC (both Easy and Challenge) Clark et al. (2018), BoolQ Clark et al. (2019), COPA Roemmele et al. (2018), HeadQA Vilares and Gómez-Rodríguez (2019), HellaSwag Zellers et al. (2019), LAMBADA Paperno et al. (2016), OpenBookQA Mihaylov et al. (2018), PIQA Bisk et al. (2020), ReCoRD Zhang et al. (2018), SciQ Johannes Welbl Nelson F. Liu (2017), and Winogrande Zellers et al. (2020). Figure 1 shows the average results across all benchmarks. Some individual benchmarks are shown in Fig 5, with the rest in Appendix J.

Additionally, we carried out comparative studies on RWKV and ChatGPT / GPT-4, see Appendix L. They revealed that RWKV is very sensitive to prompt engineering. When the prompts were adjusted (re-ordered) from the ones used for GPT to more suitable for RWKV, the performance (F1) increased even from 44.2% to 74.8%. For sarcasm detection, RWKV outperformed ChatGPT, but was still slightly worse than the SOTA solution.

2 Extended Context Finetuning

Unlike transformers, RNNs do not have a pre-defined sequences length when they are created. However in order to efficient make use of compute we nevertheless need to preprocess the training data into contexts of the same length. We find that we are able to teach the model how to efficiently handle substantially larger batch sizes by finetuning with progressively increasing sequence length. Specifically, we first double the sequence length from 1024 to 2048 and finetune for 10B tokens from the original pretraining corpus, then we double again to 4096 for 100B tokens from the same corpus, and finally double to 8192 tokens for another 100B tokens from the same corpus. In Fig. 6 we show that increasing context length leads to lower test loss on the Pile, an indication that RWKV can make effective use of long contextual information.

3 Long Context Benchmarks

Additionally, we evaluate our model’s ability to handle very long sequences by comparing to state-of-the-art long sequence models on the Long-Range Arena (LRA) benchmark Tay et al. (2021). LRA is designed to assess the performance of models in handling lengthy context situations. It includes a collection of tasks with sequences ranging from 1,000 to 16,000 tokens, covering various types of data like text, natural language, synthetic images, and mathematical expressions. We apply RWKV on the LRA benchmark and the results are in Appendix J.2. The results show that RWKV performs second only to the S4 model in five datasets.

Inference Experiments

We benchmark inference requirements according to size and family. Specifically, we evaluate text generation speed and memory requirements on typical compute platforms including CPU (x86) and GPU (NVIDIA A100 80 GB). For all of our inference experiments we use float32 precision and the HuggingFace Transformers Wolf et al. (2020). We include all model parameters in the parameter count, including both embedding and non-embedding layers. Performance under different quantization setups is left to further work. See Appendix K for more results.

Future Work

There are several promising directions for future work on the RWKV architecture. Work can be done to increase model expressivity by enhancing the time-decay formulations and exploring initial model states while maintaining efficiency.

The RWKV computational efficiency can be further improved by applying a parallel scan in the wkvtwkv_{t} step to reduce the computational cost to O(Blog(T)d)O(B\log(T)d).

The mechanisms used in RWKV can be applied to encoder-decoder architectures, potentially replacing the cross-attention mechanism. This could be applicable in seq2seq or multimodal settings, thereby enhancing efficiency during both training and inference.

RWKV’s state (or context) can be leveraged for interpretability, predictability in sequence data, and safety. Manipulating the hidden state could also guide behavior and allow greater customizability through prompt tuning.

The RWKV architecture is not perfect, and can be improved via many aspects, such as modifying the formulae or implementing larger internal states. Larger states can enhance the model’s memory to previous context and improve performance over various tasks.

Conclusions

We introduced RWKV, a new approach to RNN models exploiting the potential of time-based mixing components. RWKV introduces several key strategies that allow it to capture locality and long-range dependencies while addressing limitations of current architectures by: (1) replacing the quadratic QK attention with a scalar formulation at linear cost, (2) reformulating recurrence and sequential inductive biases to enable efficient training parallelization and efficient inference, and (3) enhancing training dynamics using custom initializations.

We benchmark the proposed architecture in a wide variety of NLP tasks and show comparable performance to SoTA with reduced cost. Further experiments on expressivity, interpretability, and scaling showcase the model capabilities and draw parallels in behavior between RWKV and other LLMs.

RWKV opens a new route for scalable and efficient architectures to model complex relationships in sequential data. While many alternatives to Transformers have been proposed with similar claims, ours is the first to back up those claims with pretrained models with tens of billions of parameters.

Limitations

While our proposed RWKV model has demonstrated promising results regarding training and memory efficiency during inference, some limitations should be acknowledged and addressed in future work.

First, the linear attention of RWKV leads to significant efficiency gains but still, it may also limit the model’s performance on tasks that require recalling minutiae information over very long contexts. This is due to the funneling of information through a single vector representation over many time steps, compared with the full information maintained by the quadratic attention of standard Transformers. In other words, the model’s recurrent architecture inherently limits its ability to “look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned time decay helps prevent the loss of information, it is mechanistically limited compared to full self-attention.

Another limitation of this work is the increased importance of prompt engineering in comparison to standard Transformer models. The linear attention mechanism used in RWKV limits the information from the prompt that will be carried over to the model’s continuation. As a result, carefully designed prompts may be even more crucial for the model to perform well on tasks.

The above RWKV property was confirmed by studies on prompt engineering presented in Appendix L. By changing the order of the information pieces, we were even able to almost double the RWKV performance for some tasks.

Ethics Statement

In this paper, we present a novel architecture for sequential data processing and prove its effectiveness by building a series of LLMs trained on publicly released pretraining data Gao et al. (2020); Biderman et al. (2022) and later fine-tuned on publicly available instructions Taori et al. (2023); Chaudhary (2023); Cheung (2023); Anand et al. (2023); Anonymous (2023); Yang (2023); Ji et al. (2023a, b).

As a novel architecture for sequential data, RWKV has the potential to improve sequence-based models across different applications ranging from natural language processing to biomedical data processing or climate modelling. Since the training code is released open source, RWKV contributes to the democratization of AI, levels the playing field, and empowers members of the Open Source community to inspect, study, and finetune RWKV in particular tasks. Moreover, it contributes to advancing the understanding of LLMs capabilities and limitations. A significant amount of work has been devoted to increasing the efficiency of RWKV training so as to minimize its cost and promote accessibility.

As LLMs trained on public data, RWKV’s lower inference cost compared to Transformer alternatives makes it more suitable for deployment in consumer and edge hardware, which is a step towards the democratization and distribution of LLMs to the general public, creating better privacy and ownership incentives. It also lowers the resource barrier to Chat assistants and text generation for small and/or underrepresented communities. PreTrained model weights for different sizes ranging from 0.1B to 14B parameters trained on multiple languages are released to increase ease of adoption and allow for the study of emergent phenomena.

On the other hand, with lower resource barriers, the spreading of AI-generated text might become more prevalent. Current RWKV LLMs may exhibit and/or reproduce biases and potentially harmful content present in the data used for training. Nonetheless, mitigation and finetuning strategies discussed for other, large Transformer models should be applicable to RWKV as well.

Acknowledgements

We thank StabilityAI for the compute used to train our models and for technical support in development of RWKV. We also thank the members of the RWKV and EleutherAI Discord servers for their help and work on further extending the applicability of RWKV to different domains.

References

Appendix A Author Contributions

All authors contributed to the drafting of this paper. Eric Alcaide and Quentin Anthony organized the paper and its experiments and were involved in all phases of the development process.

Bo Peng (lead), Matteo Grella, Xuzheng He, Haowen Hou, Jiaming Kong, Johan S. Wind

Stella Biderman (lead), Kranthi Kiran GV, Krishna Sri Ipsit Mantri, Atsushi Saito, Qihang Zhao, Peng Zhou, Rui-Jie Zhuåç

Xingjian Du, Rui-Jie Zhu, Bolun Wang, Ruichong Zhang, Jian Zhu, Rui-Jie Zhu

Samuel Arcadinho, Przemysław Kazienko, Qinghua Zhou

Huanqi Cao, Michael Chung, Matteo Grella, Ferdinand Mom, Zhenyuan Zhang

Jan Kocoń (lead), Przemysław Kazienko, Bartłomiej Koptyra, Hayden Lau, Xiangru Tang, Stanisław Woźniak, Zhenyuan Zhang

Appendix B Author Contributions

Original RWKV idea, original code, performance optimizations, original experiments, and trained RWKV models from 0.1B to 14B.

Manuscript (initial draft sections 1, C; sections 3, 7 and 8; revision and proofreading; final version ). Figures (2, 3, 3, 8). Experiments section 6. Appendices E, K. Contributions to Appendix M.

Manuscript (organization, initial draft sections 1, C, 2; revision and proofreading; final version).

Manuscript (abstract and sections 1, 9, and 7; proofreading and revision).

Contributions to Figures 7, 13, and 14. Contributions to Appendix K.

Performed the scaling laws analysis and evaluated competitor models on benchmark tasks.

Manuscript (contributions to 3.2 and 3.3; proofreading and revision). Experiments for Appendix I.

Manuscript (proofreading and revision). Contributions to Appendix M, J.

Manuscript (contributions to section I; proofreading and revision).

Evaluation on Long Range Arena Benchmark (TBD until 5.31).

Manuscript (sections H, I, 8; contributions to sections 1, 7 and 9; proofreading and revision). Contributions to Appendix D.

Manuscript (sections C and 5; contributions to section 2; revision and proofreading). Tables K and K. Appendix 4.

Manuscript (contributions to section 2; proofreading and revision). Contributions to Figure8. Appendix I. Contributions to appendix H.

Manuscript (proofreading and revision). Contributions to Section 6, 9, and Appendix L.

Manuscript (Section 1; proofreading and revision). Contributions to Appendix L.

Manuscript (revision and proofreading). Appendix H.

Manuscript (revision and proofreading) Contributions to Appendix L.

Manuscript (contributions to section 1 and 9; proofreading and revision). Contributions to Appendix M.

Manuscript (contributions to section 1, C, 3.3, I; proofreading and revision). Contributions to Appendix D.

Manuscript (sections 2 and 5; contributions to section C). Contributions to Appendix J

Manuscript (rewrote section 3; final version). Initial draft Ethics Statement).

Manuscript (sections C and 2; contributions to abstract; revision and proofreading). Contributions to Appendix M.

RWKV performance optimizations (CUDA), Contributions to Appendix 4.

Manuscript (proofreading and revision); Contributions to Figure 6 and Appendix M.

Manuscript (revision and proofreading). Figure 3. Experiments Appendix I. Contributions to Appendices D and M.

Manuscript (proofreading and revision). Contributions to Table 5.

Manuscript (Proofreading and revision of section 3; Add missing citations in 3.3). Revision of Figures 2 and 12.

Manuscript (section C; proofreading and revision). Figures 3 and 6.

Appendix C Additional Related Work

Recently, a number of techniques have been proposed to address the limitations of transformers.

Many transformer variants (“x-formers”) have been introduced to reduce the complexity of transformers Tay et al. (2022), including sparse attention Beltagy et al. (2020); Kitaev et al. (2020); Guo et al. (2022), approximating the full attention matrix Wang et al. (2020); Ma et al. (2021); Choromanski et al. (2020), combining chunked attention with gating Ma et al. (2023) and other efficient methods Katharopoulos et al. (2020); Jaegle et al. (2021).

Some recent works like FlashAttention Dao et al. (2022a) and others Rabe and Staats (2022); Jang et al. (2019) share similarities with RWKV’s chunked computation scheme. Despite being memory-efficient, their time complexity remains quadratic or contains chunk size as a hidden factor. In contrast, RWKV achieves better space and time complexity during inference by formulating a linear attention as an RNN.

Another line of research replaces the attention mechanism with other modules to scale to long sequences. MLP-Mixer and others Tolstikhin et al. (2021); Liu et al. (2021) propose replacing attention by Multi-Layer Perceptrons (MLPs) in computer vision tasks. The Attention Free Transformer (AFT) (Zhai et al., 2021) and HrrFormer (Alam et al., 2023) replaces dot-product self-attention with a computationally efficient alternative. None of these models have been successfully scaled to the point where drawing comparisons with transformer-based large language models makes sense.

There has also been substantial research into state space models (SSM) (Gu et al., 2021) and its variants (Dao et al., 2022b; Gupta et al., 2022; Poli et al., 2023). In contrast to the preceding models, SSM and its successors have shown substantial progress towards efficient scaling. Simultaneously with this work, Poli et al. (2023) train SSM-based models with 125 million and 355 million parameters and show that the performance is on-par with a transformer that uses a mix of local and global attention (Black et al., 2021).

Inspired by the success of transformers, RNN-style Hochreiter and Schmidhuber (1997); Chung et al. (2014) recursive components have also been modified to increase context length, such as the Recurrent Memory Transformer Bulatov et al. (2022, 2023) and Linear Recurrent Units Orvieto et al. (2023). Most similar to our work, the Quasi-Recurrent neural network (QRNN) Bradbury et al. (2017) uses both convolutional layers and recurrent pooling functions across timesteps and channels. While QRNN utilizes convolutional filters with fixed sizes, RWKV employs a time-mixing module as an attention mechanism with time-decaying factors. Different from the element-wise pooling in QRNN, RWKV includes a parametrized channel-mixing module that is parallelizable.

Appendix D Time-Mixing Block as an RNN Cell

As stated in 3.3, the RWKV time-mixing block can be formulated as an RNN, as the WKVWKV computation can be written in such a recursive form:

The dataflow of the RNN-like time-mixing is shown in Fig. 8, where the hidden states hh is the numerator-denominator tuple (a,b)(a,b). To avoid overflow in calculating ekte^{k_{t}}, a numerical trick is used in the official implementation. Noticing that a1=ek1v1a_{1}=e^{k_{1}}\odot v_{1} and b1=ek1b_{1}=e^{k_{1}}, we set a1=v1,b1=1,p1=k1a^{\prime}_{1}=v_{1},b^{\prime}_{1}=1,p_{1}=k_{1}, where ptp_{t} stores the shared exponents of ata_{t} and btb_{t}. Now the above recursion can be converted into a numerical safe version, for each time step t>1t>1:

The update to at,bta^{\prime}_{t},b^{\prime}_{t}, and their shared exponent is also carried out in a similar fashion:

The RWKV model has an internal state that stores some previous information. In each layer, the internal state consists five parts, each of which is a vector with DD numbers, where DD is the model dimension. The five parts are:

The current input of the Time-mix block xtx_{t};

The current input of the Channel-mix block yty_{t};

The numerator of the WKVWKV value ata^{\prime}_{t}, as defined in equation (26);

The denominator of the WKVWKV value btb^{\prime}_{t}, as defined in equation (27);

An auxiliary state ptp_{t} in (28), which is used for WKVWKV computation to maintain numerical precision.

Which yields a total size of 5DL5DL parameters. It is worth noting that in an algebraic context with infinite precision, the helper state ptp_{t} can be ignored, and the WKVWKV numerator and denominator can be computed directly using equations (21) and (22), reducing the size of the internal state to 4DL4DL.

Appendix E Parameter initializations

We describe the specific parameter initializations below and motivate the design choices. Parameters belonging to residual blocks are often adjusted by layer depth and total number of layers. Let #\# denote the vocabulary size, ss denote the embedding dimension, dd denote the hidden size (we use d=4sd=4s), LL the number of layers, ll the layer index (from 0 to L1L-1), we use the following initializations:

Embeddings are initialized to U\mathcal{U} (±\pm1\times1041\text{\times}{10}^{-4}) as explained in 3.4

For the time-mixing blocks (11, 12, 13), initializations are μki=(is)1lL\mu_{k_{i}}=(\frac{i}{s})^{1-\frac{l}{L}}, μvi=(is)1lL+0.3lL1\mu_{v_{i}}=(\frac{i}{s})^{1-\frac{l}{L}}+\frac{0.3l}{L-1} and μri=12(is)1lL\mu_{r_{i}}=\frac{1}{2}\cdot(\frac{i}{s})^{1-\frac{l}{L}}

For the channel-mixing blocks (14, 15), μki\mu_{k_{i}} and μri\mu_{r_{i}} are initialized to (is)1lL(\frac{i}{s})^{1-\frac{l}{L}}

wiw_{i} (16), also known as “time decay”, is initialized to 5+8(id1)0.7+1.3lL1-5+8\cdot(\frac{i}{d-1})^{0.7+\frac{1.3l}{L-1}}. Intuitively, it is the discount factor applied to previous tokens over time.

uiu_{i} (16), also known as “bonus”, is set to 0.5(((i+1)mod3)1)+log0.30.5\cdot(((i+1)\mod 3)-1)+\log 0.3. It is the special weighting applied to the current token in equation 16. The alternating zigzag pattern initially creates subtle variations in the tensor elements, which are intended to help the model treat different dimensions of the embedding distinctively.

WoW_{o} (17) (time-mixing) and WvW_{v} (channel-mixing) are initialized to N(0,ds=2)\mathcal{N}(0,\sqrt{\frac{d}{s}}=2)

All other Wr,Wk,WvW_{r},W_{k},W_{v} weights are initialized to 0 so the model can start learning from the beginning without noisy signals.

All LayerNorm weights start from 1 and biases from 0.

Appendix F Small Init Embedding

This section presents the experimental validation of small initialization embedding. The experimental setup is as follows. In the baseline configuration, the parameters are initialized using a normal distribution with a mean of 0.0 and a standard deviation of 0.02, which is a commonly used initialization method in models like BERT and GPT. On the other hand, in the small initialization of the embedding (small init emb) experiment, the parameters are initialized using a uniform distribution with a range of 1e-4, which is slightly different from RWKV where a normal distribution with a standard deviation of 1e-4 is used. However, this difference is negligible and does not affect our conclusions. The experiments were conducted with a batch size of 400. As depicted in Figure 9, the loss curve for the small init emb exhibits a faster rate of decrease and convergence compared to the traditional initialization using a normal distribution.

Appendix G Hyperparameters

To train the models mentioned, we use ϵ=(0.9,0.99)\epsilon=(0.9,0.99) without weight decay for the Adam optimizer, and switch batch size dynamically between 128 or 256 sequences, each of 1024 tokens. We further organize the training into multiple mini-epochs, each of 40320 samples, to guide our learning rate schedule. The training process takes 8043 mini-epochs to make one pass over the Pile. The initial warming up mini-epochs have a constant learning rate of “Init LR”. After the warming up mini-epochs, the learning rate exponentially decays until in the last mini-epoch, in which the model finishes training on the entire Pile, the learning rate arrives at the “End LR”. The related hyperparameters are shown in Table 3.

Appendix H Gradient Stability in RWKV

In this section, we present a mathematical description of the gradient stability property in RWKV, focusing specifically on the time-mixing block. By gradient stability we mean that if the inputs xtx_{t} are bounded and the model parameters are fixed, then the gradients with respect to WkW_{k} and WvW_{v} are uniformly bounded for all TT (thus not exploding). Consequently, we can control the amount each xtx_{t} contributes to the gradient at TT in a naturally decaying fashion by the weight decay mechanism ww (thus not vanishing unless desired).

First, we make the simplification that there are no token shifts, this will not affect the final conclusion. In this scenario, wkvTwkv_{T} can be written as

The loss function at position TT can be written as

Because wkvTwkv_{T} relates to (Wk)i,j(W_{k})_{i,j} and (Wv)i,j(W_{v})_{i,j} only through the ii-th channel (wkvT)i(wkv_{T})_{i}, we have

The first part of the above equation contains trivial operations like output layers, and other layers of time-mixing, which can be proven inductively. The second part of the above equation can be bounded as

can also be bounded. Note that wkvwkv’s softmax operation contains at least two non-zero terms (uu and ww), so the above “covariance” will not degenerate into 0.

Appendix I Model Behavior Visualization

The right plot illustrates the time decays (ewe^{-w}) in each layer of the RWKV-169M model, sorted along the channel axis. Notably, several decays in the last layers are very close or equal to one, implying that certain information is preserved and propagated throughout the model’s temporal context. Meanwhile, many decays in the initial layer are close to zero, which corresponds to local operations in wkvwkv (16), likely to be associated with tasks such as text parsing or lexical analysis. (Note that the local operations in wkvwkv are due to the extra parameter uu, when ewe^{-w} is degenerated into 0.) These patterns of time decays are partly learned, but also come from parameter initialization as it speeds up training.

The plot below shows the information retrieval and propagation path in the RWKV-430M model. The experiment follows the causal trace method introduced by Meng et al. (2022), where we

Run the model once, and record all states and activation of each layer during the computation;

Corrupt the input embeddings of the subject using noise (“The Eiffel Tower” in this example);

Restore the states and activation of a certain layer at a certain token during the computation, and record the log-probability of the model outputting the correct answer (“Paris”).

Unlike transformers, RWKV relies on the recursive propagation of information in the time dimension. In this case, the fact that the Eiffel Tower is located in Paris is retrieved in layer 4 just after the model sees “The Eiffel”. It is then passed down to the subsequent layers. In layer 20, mostly, the information is propagated through time until reaching where it is needed. Finally, at the token “of”, it is passed down to the last layer for outputting the answer.

Appendix J Additional Evaluations

A dataset designed for multiple-choice question answering, encompassing science exam questions ranging from third grade to ninth grade. It has Easy and Challenge subsets that we report results on separately.

A binary yes/no question answering benchmark.

A dataset to evaluate achievement in open-domain commonsense causal reasoning.

A benchmark consisting of graduate-level questions encompassing various fields such as medicine, nursing, biology, chemistry, psychology, and pharmacology.

Zellers et al. (2019) A novel benchmark for commonsense Natural Language Inference (NLI) which is build by adversarial filtering against transformer models.

A benchmark dataset that evaluates the model’s contextual reasoning and language comprehension abilities by presenting context-target pairs, where the objective is to predict the most probable target token. We follow standard practice and use the untokenized version created by OpenAI (Brown et al., 2020).

A QA dataset to evaluate human comprehension of a subject by incorporating open book facts, scientific knowledge, and perceptual common sense, drawing inspiration from open book exams.

A benchmark for the task of physical common sense reasoning, which consists of a binary choice task that can be better understood as a set of two pairs, namely (Goal, Solution).

A benchmark for evaluating commonsense reasoning in reading comprehension by generating queries from CNN/Daily Mail news articles and requiring text span answers from corresponding summarizing passages.

A multiple-choice QA dataset which was created using an innovative approach to gather well-crafted multiple-choice questions that are focused on a specific domain.

A dataset designed to evaluate the acquisition of common sense reasoning by neural language models, aiming to determine whether we are accurately assessing the true capabilities of machine common sense.

J.2 Evaluation on Long Range Arena

The Long-Range Arena (LRA) benchmark (Tay et al., 2021) is designed to assess the performance of models in handling lengthy context situations. It includes a collection of tasks with sequences ranging from 1,000 to 16,000 tokens, covering various types of data like text, natural language, synthetic images, and mathematical expressions. We apply RWKV on the LRA benchmark and the report results are in Table 4. Other models’ performances are directly cited from Gu et al. (2022); Alam et al. (2023).

The results show that RWKV performs second only to the S4 model in five datasets. While RWKV substantially underpreforms S4 on Image, Pathfinder, and Path-X, on the problems related to natural language and computer code processing RWKV performs on par with S4 or nearly so.

J.3 Enwik8 Perplexity

We also evaluate our model in terms of perplexity on the Enwik8 dataset. Baseline comparisons are made with Reformer (Kitaev et al., 2020), Synthesizer (Tay et al., 2020) (the best performing dense version), Linear Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020). L,d,L,d, and TT denote the number of blocks (network depth), dimension of features, and sequence length, respectively. Both Linear Transformer and Performer are implemented with customized CUDA kernels (github.com/idiap/fast-transformers), and all other models are implemented in native Pytorch. 1 No weight decay nor dropout was used. 2 Trained with AdamW and weight decay set to 0.1, dropout of 0.1, batch size of 16, and initial learning rate of 6e-4.

Appendix K Inference results

Figures 13 and 14 illustrate, respectively, the results on time (s) and memory (RAM, VRAM) requirements for LLM inference in float32 precision. We benchmark the following model families and sizes:

OPT Zhang et al. (2022): 125m, 350m, 1.3b, 2.7b, 6.7b, 13b

GPT-Neo Black et al. (2021): 125m, 1.3b, 2.7b

Pythia Biderman et al. (2023b): 160m, 410m, 1.4b, 2.8b, 6.7b, 12b

Appendix L Importance of prompt construction and comparison to GPT models

Inspired by Kocoń et al. (2023), we compared the zero-shot performance of the RWKV-4-Raven-14B with ChatGPT (access in February 2023) and GPT-4 using several known NLP tasks, i.e., recognizing textual entailment (RTE), Winograd Natural Language Inference (WNLI), and recognizing emotions elicited in readers (GoEmotions and PolEmo2). Each model got the same prompts manually chosen to receive proper responses from the ChatGPT model. As shown in Tab. 6, RWKV performs significantly worse than ChatGPT and GPT-4 in several specific tasks. We suspect that this disparity is likely caused by the choice of prompts used to generate the answers since the prompts are written in natural language and do not take into account that RWKV, as an RNN, is unable to look back inside an instruction.

When the instruction style was adapted (re-ordered) to respect that RNNs are not capable of "retrospective processing", the quality may significantly change, e.g., for RTE Wang et al. (2019) F1 Macro increased from 44.2% to 74.8%. We hypothesize that RWKV models are more sensitive to the position of the components in the context, as RNN-based architectures cannot look back and readjust the weight of previous information. For better performance, the desired information should be placed after the main question.

An example ChatGPT prompt for recognizing textual entailment (RTE) Having premise judge if the following hypothesis is logically connected with the premise. Answer "entailment" if yes, or "not_entailment" if no. A re-ordered RWKV prompt for RTE taking into account the nature of the RNN Can you tell me if the hypothesis is entailment or is not entailment to the premise? premise: hypothesis: While separating the instruction from the input is relatively easy to do, some other aspects of prompt engineering are harder to quantify. For that purpose, we also tested the approach of stating the input after the question on multiple other tasks, i.e., aggression and sarcasm detection, classification of unhealthy (offensive) texts, mathematical Q&A, and sentiment analysis, see Tab. 7. The results suggest that better prompts might reduce the disparity between models. Raven achieves comparable results to ChatGPT on unhealthy conversation detection and even surpasses it on the sarcasm detection dataset. While such an approach to prompting looks necessary, it is not enough in itself to replace the capability of having free access to the whole context. Therefore, prompt engineering seems to be significantly more important for the RNN models rather than for standard transformers. It is entirely possible that good prompts to RNN models do not mean additional restrictions, but should simply be constructed using completely different guidelines. The authors of the aforementioned paper Kocoń et al. (2023)This is in line with the idea discussed in Wei et al. (2022b) perform chain-of-thought to improve results on the MathQA dataset. Even including this approach, the Raven model achieved a very low accuracy of 5.43%. Without it, the model performed even worse, performing only very basic and simple calculations and achieving 4.13% accuracy. Raven struggled with questions that required intermediate results. It is likely that the order of information presented in the math questions inside the dataset poses a challenge for the RWKV model. It is yet to be seen if prompt engineering can address this issue. This further emphasizes the importance of the order of information the model receives.

Appendix M Cases

In this part, we present a few instances of outputs produced by RWKV model using a Chat interfacettps://github.com/BlinkDL/ChatRWKVhttps://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio.