Recurrent Memory Transformer

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Introduction

Transformers (Vaswani et al., 2017) have been widely adopted across multiple domains and tasks (Radford et al., 2018; Dong et al., 2018; Devlin et al., 2019; Dosovitskiy et al., 2021; Ramesh et al., 2021; Jaegle et al., 2021). The key component of Transformer layer is a self-attention. Self-attention allows to update each sequence element representation with information from all other elements in the sequence. As a result, rich contextual representation for every element is generated at the end of encoding. This way, global sequence-level and local information are stored in a single representation. However, this mixing of two types of information in a single representation has limitations. Distributed storage of global features across all sequence elements results in global features "blurring" and makes it harder to access them. Another well-known deficiency of Transformers is poor scaling of self-attention with input sequence length that hurts its applications to long inputs (Child et al., 2019; Guo et al., 2019; Dai et al., 2019; Beltagy et al., 2020; Ainslie et al., 2020; Zaheer et al., 2020; Wang et al., 2020; Choromanski et al., 2020).

Our work introduces a memory-augmented segment-level recurrent Transformer named Recurrent Memory Transformer (RMT). RMT uses a memory mechanism based on special memory tokens (Burtsev et al., 2020) added to the input sequence. Memory tokens provide additional reserved capacity to the model that could be used to process information which is not directly representing any element in the input sequence. To process long sequences, we split them into segments and pass memory states from a previous to a current segment. This memory passing makes the model recurrent and removes the input sequence length limitations. RMT model can theoretically work with infinite lengths but, in practice, it is limited by memory capacity and the efficiency of memory access/update operations. Our implementation of both memory and recurrence in RMT requires no changes to the Transformer model because modifications are made only to the input and output sequences of the model.

We tested RMT on the tasks that require global information about the whole input sequence to be solved. We use copy, reverse, and associative retrieval tasks in the setting where the input sequence is split into segments. RMT and Transformer-XL perfectly solve these tasks, but exceeding some value of sequence length, RMT starts to outperform Transformer-XL. Also, we experimentally show that the proposed Recurrent Memory Transformer requires less memory size to perform closely to Transformer-XL on language modeling tasks. RMT code and experiments are availablehttps://github.com/booydar/LM-RMT. The code, results of the raw experiments and hyperparameters are provided in the supplementary materials and on GitHub..

1. In this study we augment Transformer with token based memory storage and segment-level recurrence.

2. We experimentally evaluate proposed architecture as well as vanilla Transformer and Transformer-XL on memory-intensive tasks such as copy, reverse, associative retrieval, and language modeling. We show that RMT outperforms Transformer-XL for sequence processing tasks and on par with Transformer-XL on language modeling but requires less memory.

3. We show that Tr-XL cache could be combined with RMT leading to better performance on language modeling.

4. We analysed how the Transformer model learns to use memory. Specific interpretable memory read-write patterns of attention are shown.

Related work

In our study we add a memory to general purpose attention based neural architecture. Memory is a recurrent topic in neural networks research. It had started from the early works (McCulloch and Pitts, 1943; Stephen, 1956) and significantly progressed in 90’s with introduction of Backpropagation Through Time learning algorithm (Werbos, 1990) and Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) neural architecture. Today memory-augmented neural networks (MANNs) usually rely on some kind of recurrent external-memory which is separate from the model’s parameters. Neural Turing Machines (NTMs) (Graves et al., 2014) and Memory Networks (Weston et al., 2014) are equipped with a storage for vector representations that can be accessed with an attention mechanism. Memory Networks (Weston et al., 2014; Sukhbaatar et al., 2015) were designed to enable reasoning by sequential attention over to the content of a memory. NTMs followed by Differentiable Neural Computer (DNC) (Graves et al., 2016) and Sparse DNC (Rae et al., 2016) are implemented as recurrent neural networks able to write to memory storage over time. All these models are differentiable and can be trained via backpropagation through time (BPTT). Parallel line of research extends recurrent neural networks such as LSTM with data structures like stacks, lists, or queues (Joulin and Mikolov, 2015; Grefenstette et al., 2015). MANN architectures with a more advanced addressing mechanisms such as address-content separation and multi-step addressing were proposed in (Gulcehre et al., 2016, 2017; Meng and Rumshisky, 2018). The Global Context Layer model (Meng and Rumshisky, 2018) uses the idea of address-content separation to solve the difficulty of training content-based addressing in the canonical NTM.

The recent rise of Transformer models also resulted in introduction of a number of new memory architectures. Transformer-XL (Dai et al., 2019) introduces a segment-level recurrence at the level of hidden representations. These representations of a sequence are computed and stored in the cache to be reused as an extended context for the next segment. Compressive Transformer (Rae et al., 2019) adds the second layer of memory to Transformer-XL. This memory compresses and stores information from the cache. \infty-former (Martins et al., 2021) utilizes continuous-space attention and represents input sequence as a continuous signal to make long-term memory unbounded. Memory Layers (Lample et al., 2019) model has a product key memory layer instead of a feed-forward layer within Transformer block to increase model capacity.

In many variations of Transformer different sorts of global representations are added. Among them are Star-Transformer (Guo et al., 2019), Longformer (Beltagy et al., 2020), GMAT (Gupta and Berant, 2020), Extended Transformer Construction (ETC) (Ainslie et al., 2020) and Big Bird (Zaheer et al., 2020). All these architectures re-design self-attention mechanism to reduce it computational complexity with and ensure input coverage with the help of global representations. Memory Transformer (Burtsev et al., 2020) keeps Transformer model intact and adds memory by extending input sequence with special memory tokens. Perceiver IO (Jaegle et al., 2021) maps an entire arbitrary input to the fixed number of latent representations. Transformer layers do further processing over latent memory representations only.

Segment-level recurrence in Transformers is actively explored in a number of studies. Transformer-XL, Compressive Transformer keep previous states and re-use them in subsequent segments. Ernie-Doc (Ding et al., 2021) improves processing by using same-layer recurrence instead of attending to previous layer outputs of a precedent segment. Memformer (Wu et al., 2020) introduces a dedicated memory module to keep previous hidden states in summarized representations. Memformer uses two special layers added to the Transformer model. Memory cross-attention layer reads from memory and memory slot attention layer updates it. MART (Lei et al., 2020) has a similar approach as Memformer but uses memory update rules analogous to LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014). FeedBack Transformer (Fan et al., 2020) goes further with full, and not segment-level, recurrence. FeedBack Memory merges past hidden representations from all layers into a single vector and makes it accessible to the computations at any layer. The disadvantage of full recurrence is that it is less parallelizable. FeedBack Memory requires every sequence element to be processed sequentially. In segment-level recurrent models, all elements of a segment are processed by Transformer layers in parallel. Only segments are processed sequentially. Staircase Transformer (Ju et al., 2021) combines segment-level recurrence and depth recurrence. Staircase models use the output for previous segments and pass them as input for the next segment. Our Recurrent Memory Transformer is based on special memory tokens similar to Memory Transformer, segment-level recurrence as in Transformer-XL, and depth-recurrent mechanism for memory processing similar to Staircase.

Recurrent Memory Transformer

Transformer-XL (Dai et al., 2019) extends Transformer model with state re-use cache mechanism for segment-level recurrence and relative position encoding. Input sequence is split on segments processed sequentially. Hidden states computed for the previous segment MnM^{n} are cached for each transformer layer nn. The input of the layer nn consists of the last mm states from the cached memory and output of previous Transformer layer for the current segment τ\tau:

here, SG stands for stop-gradient, \circ denotes concatenation. Cached states allow to increase effective context size of Transformer model and save on compute operations.

In Transformer-XL, self-attention layers are modified to use relative position encodings to improve generalization to longer attention lengths. The overall architecture is shown in the Figure 2.

Memory augmented Transformers such as GMAT, ETC, Memory Transformer (Gupta and Berant, 2020; Ainslie et al., 2020; Burtsev et al., 2020) proposed to use special global tokens as storage for representations. Usually, memory tokens are added to the beginning of the input sequence. However, in decoder-only architectures the causal attention mask makes impossible for memory tokens at the start of the sequence to collect information from the subsequent tokens. On the other hand, if memory tokens are placed at the end of the sequence then preceding tokens unable to access their representations. To solve this problem we add a recurrence to the sequence processing. Representations of memory tokens placed at the end of the segment are used as an input memory representations at the start as well as at the end of the next segment.

In the Recurrent Memory Transformer input is augmented with special [mem] tokens, processed in a standard way along with the sequence of tokens. Each memory token is a real-valued vector. mm memory tokens are added at the beginning of the segment tokens representations Hτ0H_{\tau}^{0} and the same mm tokens are added at the end:

here NN is a number of Transformer layers.

The starting group of memory tokens functions as a read memory that allows sequence tokens to attend to memory states produced at the previous segment. The ending group works as a write memory that can attend to all current segment tokens and update representation stored in the memory. As a result, HτwriteH_{\tau}^{write} contains updated memory tokens for the segment τ\tau.

Segments of the input sequence are processed sequentially. To enable recurrent connection between segments, we pass outputs of the memory tokens from the current segment to the input of the next segment:

Both memory and recurrence in the RMT are based only on global memory tokens. It allows to keep the backbone Transformer unchanged and make RMT memory augmentation compatible with any model from the Transformer family. Memory tokens operate only on the input and output of the model. In this study we implement RMT on top of the original Transformer-XL code. Both architectures are shown in Figure 2.

Recurrence in the RMT is different compared to the Transformer-XL because the former stores only mm memory vectors per segment. On the other hand, the Transformer-XL stores m×Nm\times N vectors per segment. Also, in the RMT model memory representations from the previous segment are processed by Transformer layers together with the current segment tokens. This makes memory part of RMT effectively deeper in a number of applied Transformer layers τ×N\tau\times N. Additionally, we allow all memory tokens in the read/write block to access all other tokens in the same block. The causal attention mask is applied only to tokens of the input sequence (Figure 6(d)).

We train the RMT with Backpropagation Through Time (BPTT). During backward pass, unlike in Transformer-XL, memory gradients are not stopped between segments. The number of previous segments to backpropagate is a hyperparameter of a training procedure. We vary BPTT unroll in our experiments from 0 to 4 previous segments. Increasing this parameter is computationally expensive and requires a lot of GPU RAM. However, techniques such as gradient checkpointing could be used to alleviate this problem.

Experiments

We designed our experiments to evaluate the ability of Recurrent Memory Transformers to preserve long-term dependencies across multiple input segments. The first set of experiments includes copy, reverse, associative retrieval, and quadratic equations tasks. The second one addresses language modeling task for word-level on WikiText-103 (Merity et al., 2017) and for character-level on enwik8 (Mahoney, 2006). We compare Recurrent Memory Transformer with Transformer and Transformer-XL models.

Our RMT implementation is based on Transformer-XL repositoryhttps://github.com/kimiyoung/transformer-xl. The full set of hyperparameters is available in our repository as well as in supplementary materials. Language modeling experiments follow the same model and training hyperparameters as Transformer-XL. WikiText-103 experiments use 16-layer Transformers (10 heads, 410 hidden size, 2100 intermediate FF), enwik8 – 12 layer Transformers (8 heads, 512 hidden size, 2048 intermediate FF). We used Adam optimizer Kingma and Ba (2015) with linear schedule learning rate starting from 0.000250.00025 for 200,000 steps for WikiText-103 and 400,000 steps for enwik8. We refer to Transformer-XL with memory size equal to zero as a Baseline. With this experimental setup we were able to reproduce results for the Transformer-XL model close to the original paper.

Algorithmic Tasks. We evaluate RMT on algorithmic tasks that require information about the whole input sequence to be solved successfully. In a recurrent setting, the model has to keep information about all previous segments to make predictions.

In the Copy task, an input sequence should be replicated twice after a special start-to-generate token. In the Reverse task, an input sequence should be generated in a reverse order. Input for the Associative Retrieval task consists of NN key-value pairs. Then one key is randomly selected, and the task is to produce an appropriate value for the selected key. Another task is to solve quadratic equations. One example consists of an equation, its solution with discriminant, and an answer. The task is to generate a solution and answer, while only answer quality is evaluated.

For all tasks, input and output sequences are split into segments and processed by models sequentially. Datasets for algorithmic tasks were randomly pre-generated, the same data was used in all experiments, and character-level tokenization was used. Because Transformer-XL and RMT are decoder-only Transformer models, we don’t compute loss over the input sequence before the start-to-generate token. The loss is computed over target sequence segments only (see Section A.1 for details).

Language Modeling and NLP. We use two standard benchmarks for language modeling: WikiText-103 and enwik8. WikiText-103 (Merity et al., 2017) is used for word-level language modeling and contains 103M words from English Wikipedia articles. Enwik8 (Mahoney, 2006) is used for character-level and consists of 10810^{8} first bytes of XML text dump of the English Wikipedia. Vocabulary contains 267735 words and 204 characters for Wikitext-103 and enwik8 tokenizers accordingly.

We compare Recurrent Memory Transformer with decoder-only Transformer and Transformer-XL as baselines. Model size and training parameters are selected to match Transformer-XL paper. For Wikitext-103 an input context length was set to 150 tokens, and for enwik8 it was set to 512 characters. Another set of experiments inspected how RMT handles long-term dependencies and recurrence. We increased the number of segments and recurrent steps by making segments smaller (50 tokens for WikiText-103, 128 characters for enwik8). The increased number of recurrent steps makes language modeling tasks harder for RMT because information has to be stored in the same amount of memory for more steps.

As a testbed for the real-life application scenario we select popular long-text classification benchmark Hyperpartisan news (Kiesel et al., 2019). Instead of pre-training RMT from scratch we add recurrent memory mechanism to the most widely adopted models from HuggingFace Transformers (Wolf et al., 2020). Specifically, we augment 500 input tokens of already pretrained BERT-base, RoBERTa-base, DeBERTa-base and T5-base with the recurrent memory of size 10 and fine-tune on the target task.

Results

Baseline, Transformer-XL (Tr-XL) and RMT perform perfectly in the single segment setting on copy and reverse tasks (Figure 3). In this case, the models do not need recurrence because the whole sequence is available. When the number of segments is larger than one, non-recurrent baseline struggles to solve tasks, but both memory models demonstrate ability to retain required information from the previous segments in memory.

On Copy and Reverse tasks as a number of segments increases, RMT starts to outperform Transformer-XL with memory sizes less than the number of all previous tokens. With the number of segments up to 6 mean accuracy of Transformer-XL drops by up to 0.2 points, and with 9 segments plunges close to the baseline without memory. Associative Retrieval results are similar with the number of segments up to 4. RMT manages to solve the task with Transformer-XL closely behind. However, in the setting with 5 segments, RMT performance slightly decreases and Transformer-XL average accuracy rises higher.

We analyze how a number of segments, sequence length, a length of training context, and memory size affect models’ performance on Copy task (Figure 4). As we split a sequence into more segments it becomes more crucial to be able to pass information between segments. We split 360 tokens of source + target sequence into multiple segments. In Figure 4(a) we observe that Transformer-XL performance starts to degrade and eventually falls to the baseline model performance as the number of segments increases. In contrast, RMT continues to solve the task perfectly. In a more extreme setting, when we keep memory size fixed, but increase the total length of a sequence to copy Transformer-XL fails shortly, while RMT starts to gradually degrade only after the length of 720 tokens (Figure 4(b)).

On the Quadratic Equations task (Table 1) we have checked that it is possible to solve the task with the Transformer baseline and no segmentation used. The baseline in this case defines upper bound for this task. With multiple segments recurrency RMT solves the task perfectly, while Transformer-XL finds the task challenging.

The results of experiments on word-level language modeling on WikiText-103 are shown in Table 2. In the first section with a segment length of 150, Tr-XL and RMT outperform the baseline and Memory Transformer (MemTr) by a large margin. It shows the significance of increased effective context length by Tr-XL cache or RMT memory for language modeling. RMT improves over MemTr memory mechanism with read/write blocks. The best RMT models with memory size 10 and 25 show similar performance as Transformer-XL with a memory size equal to 75. RMT learns to use smaller memory more effectively than Transformer-XL. Additionally, the smaller memory size of RMT leads to reducing required GPU memory for running the model.

To force models to process longer recurrent dependencies the size of a segment is set to 50, so the number of recurrent steps increases. RMT with memory size 1 shows similar results to Transformer-XL with memory size 10. It is worth noting that Transformer-XL memory consists of hidden representations from all layers (in this case, it is 10×1610\times 16 vectors) when RMT memory is only memory_size vectors. Transformer-XL with memory size 50 and RMT with memory size 5 show similar perplexity values (see Section A.5).

RMT could be combined with Tr-XL cache. In this case Tr-XL cache could be seen as short-term memory keeping the nearest context and RMT memory as long-term memory. Such combination leads to the best results on WikiText-103 improving over Tr-XL.

On enwik8 RMT models with memory size 5 and Transformer-XL with memory size 40 show similar results. Confirming that RMT learns to use smaller amounts of memory representation more effectively. All results for enwik8 dataset are shown in Section A.4.

Recurrent Memory Transformer learns to make predictions depending on #BPTT_unrolls over previous segments +1+1 current segment. Transformer-XL does not use BPTT and relies only on memory_size cached states and current segment making in total: memory_size ++ segment_length tokens. In Figure 5(a), we compare RMT and Tr-XL according to the described value of visible context at training time.

RMT with a single memory vector could be trained to achieve lower perplexity as Transformer-XL with memory size 10. This means that RMT can learn to compress information from the previous observations better. Another observation is that RMT with memory sizes 10 and 25 performs only a bit weaker compared to Transformer-XL even when Transformer-XL has access to more non-compressed states (50, 100, 200) from previous segments. In general, training RMT with unrolling gradients in earlier segments drastically improves scores thus showing the importance of BPTT training but, we observe instabilities and out-of-memory issues during RMT training for a larger memory sizes with deeper BPTT unrolls.

RMT wins a lot when only one memory token is added but then the effect from increasing memory size from 5 to 50 fades (Figure 5(b)). Still, RMT with memory size 5 have performance on par with Transformer-XL with cache 50, confirming that RMT learns to store more compact representations. The results suggest that there is some optimal memory size for RMT to solve the task, and further increase does not add much.

Proposed recurrent memory mechanism affects only input and gradient flows of the augmented core model. This might be an important advantage because the memory can be added to already pretrained model. Evaluation results for four memory augmented language models fine tuned for long text classification are presented in the Table 3. Incorporation of 10 memory tokens in the input sequence of 512 allows to encode longer stretches of a text up to 2000 tokens and significantly improve metrics for the majority of models. Moreover, a combination of recurrent memory with RoBERTa-base results in state of the art performance for the Hyperpartisan news classification task (Kiesel et al., 2019). Interestingly, many competing models have input size of 4096 that is at least twice longer compared to RMT extended counterparts but still lag behind.

To get an understanding of memory operations, learned by RMT for algorithmic tasks we visualise attention maps for copy and reverse tasks (Figure 6). In each RMT attention map sequence tokens are preceded by read memory, located at the top left corner, and followed by write memory at the bottom right. Diagonal at the central part of the fig.6(a) (top) shows classic attention of token sequence to itself, but the bottom diagonal represents the operation of writing of sequence tokens to memory in straight order. When completing reverse (fig.6(a) bottom) the model learns to write the sequence to the memory in the reversed order, which is in line with common sense.

When it comes to reproducing the target sequence, the model accesses memory (fig.6(b)) and writes to the output sequence. Another operation (fig.6(c)) is rewriting from read memory to write memory. It is commonly used by RMT in settings with larger number of segments to keep information about recent segments longer.

Transformer-XL mechanism of accessing memory (fig.6(d)) does not allow straightforward writing to memory without changing sequence token representations. Sequential reading from cache is represented by diagonals on Transformer-XL attention maps. Using token representations as storage harms model performance in tasks with larger number of segments. For reverse task with 4 segments Transformer-XL with limited memory size 6 (Appendix B Figure 9(b)) attempts to mix representations of tokens and read multiple symbols from one cached state in the next segments giving average accuracy of 0.8 on the target task. Despite having the same memory size, RMT manages to compress the whole segment in memory tokens (Appendix B Figure 9(a)) and achieve mean accuracy 1.

Visualizations from Figure 6 and Appendix B Figure 9 provide evidence to support our hypotheses that Tr-XL has to mix representations from previous and current segments in the same hidden states to pass information between segments. Also, visualizations show how memory tokens in RMT help mitigate such kind of mixing. RMT ability of sequence compression to memory is illustrated in Section A.1 Figure 8. For copy with 6 segments RMT compresses and then reads the sequence of 12 tokens with just 6 memory tokens. For Transformer-XL decreasing memory size harms the accuracy score significantly with number of segments larger than 2.

Conclusions

In this paper we introduced Recurrent Memory Transformer a simple recurrent memory augmentation of Transformer model. RMT is implemented by extension of an input sequence with special global memory tokens and segment-level recurrence. Importantly, our method allows to learn more compact sequence representations and improve existing pretrained models without extensive additional compute, thus making practical machine learning applications more energy efficient and environmentally friendly.

In our experiments we compared RMT with Transformer baseline and Transformer-XL which is a well-known modification of Transformer for long sequences. RMT almost perfectly solves Copy, Reverse as well as quadratic equations tasks for sequences consisting of multiple segments outperforming Transformer-XL. It also demonstrates quality for associative retrieval task on par with Transformer-XL. As expected, baseline Transformer fails to solve these tasks for multi-segment settings.

RMT trained as a language model performs significantly ahead of Transformer baseline and shows quality metrics similar to Transformer-XL but for up to 10 times smaller memory size. Experimental results demonstrate that for fixed memory size backpropagating gradients for more segments improves performance of RMT. Proposed approach to memory augmentation is quite universal and might be easily applied to any pretrained transformer based model as demonstrated by achievement of state of the art results for long text classification task by fine tuning a combination of RoBERTa and RMT.

Analysis of attention maps suggests that better RMT performance can be related to more effective storage of input representations in dedicated memory tokens compared to mixing representations storage in Transformer-XL. RMT could be combined with Transformer-XL cache and improve the performance of both models.

Overall, results of the study show that dedicated memory storage and recurrence provided by Recurrent Memory Transformer make it a promising architecture for applications that require learning of long-term dependencies and general purpose in-memory processing, such as algorithmic tasks and reasoning. Furthermore, we believe that RMT could open the way for adding memory and recurrence to other models in the Transformer family.

Acknowledgments and Disclosure of Funding

This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138.

References

Checklist

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

Did you describe the limitations of your work? [Yes] We mention training instabilities and GPU RAM issues in Section 5.

Did you discuss any potential negative societal impacts of your work? [No] The proposed model and method do not have any specific impacts. All general negative societal impacts applicable to the field could be potentially relative.

Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? [N/A]

Did you include complete proofs of all theoretical results? [N/A]

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We include code, training scripts, and raw experimental data in the supplementary material. The supplemental materials would be published on github with the final version of the paper. Instructions for language modeling data&experiments are taken from Tr-XL repo.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 4, Appendix A, and provided supplementary material.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] All the key experiments results are reported with std. Furthermore, we provide raw experimental data in the supplementary materials.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We used different GPUs depending on the task: 1080Ti, V100, A100. We provide this information in Appendix A for each task.

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? [Yes] We refer to the original Tr-XL code and Tr-XL paper. We use it for establishing baselines and setting our methods. See Section 4

Did you mention the license of the assets? [No] Tr-XL license is Apache 2.0 and available at its github repo.

Did you include any new assets either in the supplemental material or as a URL? [Yes] Our code is in the supplemental material and on GitHub: https://github.com/booydar/LM-RMT

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [No] We used publicly available Tr-XL code (Apache 2.0) and datasets.

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] We use either synthetic data or datasets collected from the Wikipedia (Wikitext-103, enwik8).

If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Training details and additional results

Datasets were randomly generated by uniformly sampling tokens from dictionary into task sequences and generating targets accordingly to the tasks. After generation, datasets are fixed for all experiments.

Copy and reverse use sequences of sizes 24, 40, 120, 240, and 360, making total copy/reverse input length 48/72, 80/120, 240/360, 480/720, 720/1080. The associative retrieval task consists of 4 key-value pairs and one randomly selected key; the answer consists of one value. Train, validation and test sizes of copy 24, reverse 24 and associative retrieval datasets are 100000, 5000 and 10000.

Transformer-XL had the same cache size on training and validation to match RMT.

For training all models on copy and reverse, we used constant learning rate 1e-4 with reduction on plateau with decay factor of 0.5. Copy and reverse were solved by models with 4 layers and 4 heads, associative retrieval models had 6 layers and 4 heads. Models with the same context size and memory size were trained for the same number of steps and the same training parameters.

Experiments with sequence length 24 were conducted on a single Nvidia GTX 1080 Ti GPU from 1 hour to 2-3 days. Copy and reverse on longer sequence lengths were done on more powerful Tesla V100 using 1-3 devices with training time varying from 1 hour to 3-4 days.

A.2 Associative retrieval

We used code for the task dataset generation from (Ba et al., 2016)https://github.com/GokuMohandas/fast-weights/blob/539fb10e3c384d5f782af2560bf28631cd0eaa61/fw/data_utils.py.

A.3 Quadratic equations

This dataset consists of equations with integer coefficients and step-by-step solutions using the discriminant. Process of equation generation is started from uniformly sampling real roots x1,x2x_{1},x_{2} from -100 to 100. The answer of an equation is represented as x1,x2x_{1},x_{2}. Next, we find the equation as multiplication of two parentheses (xx1)(xx2)=0(x-x_{1})(x-x_{2})=0, which is expanded to x2(x1+x2)x+x1x2=0x^{2}-(x_{1}+x_{2})x+x_{1}x_{2}=0. Next, we multiply all coefficients by a random natural number α\alpha from 1 to 10. The final equation form is αx2α(x1+x2)x+αx1x2=0\alpha x^{2}-\alpha(x_{1}+x_{2})x+\alpha x_{1}x_{2}=0. A dataset sample is made of these stages in reversed order. We also provide a string with the discriminant calculation to help find the equation roots. 20 percent of equations in the dataset do not have real roots.

x^2-98*x+552=0;D=98^2-4*1*552=7396=86^2;x=(98-86)/2=6;x=(98+86)/2=92 ,

Each solution step is tokenized on char level and padded to the length of 30 tokens. The total length of each training sample is 180, the dataset has 100000 training, 10000 validation and 20000 test samples.

For this task we used models with 6 layers, 6 heads and segment sizes 180 and 30. The training was performed with the same schedule as copy and reverse on a single GTX 1080 ti for 1-2 days. Memory size for RMT and Transformer-XL was chosen equal to the segment length.

A.4 Enwik8

We verified our experimental setup by reproducing Transformer-XL results on enwik8 dataset (Table 4). We used 12-layer Baseline (Transformer), Transformer-XL, RMT in all enwik8 experiments. All results on enwik8 dataset are in Table 4. We used 2 NVIDIA A100 80Gb GPUs, training time varied from 10 to 30 hours depending on sequence length, memory size, and number of BPTT unrolls.

A.5 WikiText-103

We used 16-layer models in all experiments on WikiText-103 dataset. Training hyperparameters were used from (Dai et al., 2019) and authors PyTorch scriptshttps://github.com/kimiyoung/transformer-xl. All results on WikiText-103 dataset are in Table 5. In most of the WikiText-103 experiments, we used 2 NVIDIA A100 80Gb GPUs, training time varied from 10 to 30 hours depending on sequence length, memory size, and number of BPTT unrolls. All models except the ones noted with 2x steps were trained for 200k batches. Transformer-XL did not benefit from longer training unlike the Tr-XL + RMT model. For training the combined model we used an auxiliary loss for memory tokens, it was added to the main loss with a multiplier of 0.010.01. We set a new fixed special token to be predicted from memory as target in the auxiliary loss.

Appendix B Operations with Memory