Toeplitz Neural Network for Sequence Modeling

Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, Yiran Zhong

Introduction

Sequence modeling is a fundamental problem in natural language processing, speech processing, and computer vision. Various sequence modeling methods have been proposed in the literature, including recurrent (Hochreiter & Schmidhuber, 1997), convolutional architectures (LeCun et al., 1989), and transformers (Vaswani et al., 2017). These models utilize various properties of sequential data for their modeling. For example, recurrent models (Hochreiter & Schmidhuber, 1997) mimic the sequential property by sequentially processing the input while maintaining hidden states through steps. Convolutional models (LeCun et al., 1989) enforce the locality bias sequentially and only interact elements within local patches. Transformers use attention matrices to model pairwise relations regardless of the distance between them. Recently, Transformers (Vaswani et al., 2017; Dosovitskiy et al., 2021) show strong performance on a wide range of applications across domains and become arguably one of the most successful architectures for sequence modeling in general.

There are two main components in transformers: the attention mechanism that learns pairwise correlations of tokens from data, and the position embedding to introduce positional inductive biases. The vanilla attention mechanism requires quadratic space-time complexity, which precludes Transformers from handling long sequences. Numerous attention variants have been proposed recently to reduce the complexity, including linear transformers (Katharopoulos et al., 2020), and Performer (Choromanski et al., 2021). Although the types of attention vary, the position embedding remains in every method, which indicates the importance of position information in sequence modeling. This motivates us to ask the following question: since position information is important, can we design a model that relies entirely on the position information of its elements regardless of their content, thus alleviating the quadratic computation cost of the vanilla attention mechanism?

In this paper, we give an affirmative answer to this question by introducing Toeplitz neural network, a new efficient architecture that solely exploits relative position relations for sequence modeling. In specific, instead of attention matrices, the Toeplitz neural network uses Toeplitz matrices to capture relations between each token pair. There are two motivations for selecting the Toeplitz matrix. One is that it compactly represents relative positional relations between tokens with much fewer parameters, i.e., $2n-1$ parameters for an $n\times n$ Toeplitz matrix. The other is that the Toeplitz matrix-vector production can be efficiently processed in $O(n\log n)$ complexity, which is exactly what we used in our token mixing operation. In this way, we avoid computing content similarities between tokens and effectively reduce the quadratic computation complexity of transformers to log linear, rendering a more efficient sequence modeling architecture.

We further propose relative position encoder, a lightweight module that generates relative position parameters to assemble the Toeplitz matrices, so that the number of the TNN’s parameters will no longer depend on the sequence length. Moreover, it allows TNN to deal with varying sequence lengths without retraining. In addition, the input sequence length extrapolation becomes an important ability in sequence modeling as training on longer sequences can be prohibitively expensive (Press et al., 2022). We propose an exponential decay bias that directly applies to the Toeplitz matrix. Our model achieves a consistent performance to a sequence length of 14K tokens in inference when training on sequences of 512 tokens. We also show analytically that the Toeplitz neural network represents a general form of sequence modeling methods, and derives transformers, CNNs, and the recently proposed State-space-based methods (Gu et al., 2022) as its special forms.

We validate our model on a wide range of sequence modeling tasks and benchmarks. These include auto-regressive language modeling, text classification, image classification, and the Long-Range Arena benchmark. As illustrated in Fig. 1, our model achieves state-of-the-art performance on most tasks at a favorable log linear space-time complexity. It also demonstrates superior extrapolation capabilities when training on shorter sequences and evaluating on longer ones off-the-shelf.

Preliminary

In this section, we introduce concepts used throughout the paper, including positional embedding, token and channel mixing, and the Toeplitz matrix. Notations used can be found in Appendix A.

Positional embedding is introduced in transformers (Vaswani et al., 2017) to inject positional inductive bias. It often uses fixed or learned parameters to encode position-specific information, thus making the model position-aware. There are mainly two types of positional embeddings: the absolute positional embedding (Vaswani et al., 2017) and the relative position embedding (Shaw et al., 2018). In this work, we focus on the relative position embedding to emphasize pair-wise token relations. A typical relative positional embedding (Raffel et al., 2020) is formulated as:

where $j,i$ are two positional indices, $e_{ij}$ denotes the attention score before softmax. The $\mathbf{q}_{i},\mathbf{k}_{j}$ represents the queries and keys in the attention. The $w_{i-j}$ is a positional coefficient. In this case, the relative position information is added to the attention as a bias.

Researchers often classify various sequence modeling techniques based on the token mixing techniques used. MLP-based methods (Liu et al., 2021; Tolstikhin et al., 2021) use matrix multiplication on the sequence dimension for token mixing. FFT-based methods (Lee-Thorp et al., 2022) utilize the FFT on the sequence dimension to mix token-wise information. The State-space-based methods (Gu et al., 2022) leverage the state equations and hidden states to model sequences, as well as perform interactions between tokens.

Toeplitz matrix is a special form of a matrix that has constant values along each diagonal running from left to right, i.e.,

There are two nice properties of a Toeplitz matrix: 1). For an $n\times n$ Toeplitz matrix, we can efficiently describe it with $2n-1$ parameters. 2). The Toeplitz matrix-vector production is faster than standard matrix-vector production. In particular, we have:

We provide detailed proof in Appendix B. This property enables us to use the Toeplitz matrices to perform efficient token mixing.

Toeplitz neural network

In this section, we provide a detailed design and analysis of our proposed Toeplitz Neural Network (TNN) by giving a glance at the overall structure of our model first and then describing each of its components. We also discuss the connection between the TNN and other sequence modeling methods at the end of this section.

Our model consists of a stack of Gated Toeplitz Units (GTU) and GLU (Shazeer, 2020). GTU is a modified GLU layer injected with the proposed Toeplitz Neural Operator (TNO), as illustrated in Fig. 2. A TNO is used to perform token mixing with a Toeplitz matrix. To generate relative position coefficients for the Toeplitz matrix, we propose a Relative Position Encoder (RPE), a lightweight fully-connected sub-network to encode the relative position information. An exponential decay bias is also added to the Toeplitz matrix to enable extrapolation on longer inputs.

2 Toeplitz neural operator

Let us define a token mixing operation as:

where $\mathbf{y}$ is the token mixing result. For any $d$ -dimensional sequences, the token mixing is performed on each dimension individually.

As aforementioned in Theorem 2.1, the computation complexity of Eq. 4 is $O(n\log n)$ . As we need to perform token mixing on $d$ dimensions, our TNO has a computation complexity of $O(nd\log n)$ . One following question is how to calculate the relative position coefficients in $\mathbf{T}$ . A naive solution is to make the coefficients learnable parameters, such that the model can directly learn them from training data. However, this solution has some drawbacks: 1). Parameter explosion. For a $d$ -dimensional sequence of $n$ tokens, there are a total of $(2n-1)d$ learnable parameters, which can be prohibitively large as $n$ increases. It also shows an unsatisfactory performance in our ablation studies in Sec. 4.3. 2). Fixed input sequence length. Since the sequence length $n$ is fixed in training, we are unable to adjust the sequence length during inference, i.e., it will cause a crucial performance drop when the sequence length changes. To address these drawbacks, we propose a relative position encoder to generate the relative position coefficients.

3 Relative position encoder

Note that recent literature (Mildenhall et al., 2021) claims that projecting the scalar input to a higher dimensional space with high frequency functions, i.e., $\sin$ and $\cos$ functions, before passing a network can lead to better performance. However, in our ablations, we find that using the original integer achieves better performance.

Exponential decay bias Previous models (Vaswani et al., 2017; Qin et al., 2022) often use a fixed sequence length in both training and inference. If we need to infer a longer sequence, the model needs to be retrained on the longer sequence length to maintain the performance, which can be prohibitively expensive in the application.

ALiBi (Press et al., 2022) shows that by applying a simple penalty to the query-key attention scores, the Transformer can handle longer sequence length in inference without compromising the performance. The penalty is a linear bias that is proportional to the distance between tokens. Inspired by this technique, we propose an exponential decay bias that directly applies to the Toeplitz matrix to achieve the same goal. In specific, let us define a decay rate of $\lambda\in$ , and the new relative position coefficients $\bar{t}_{i-j}$ in $\mathbf{T}$ can be expressed as:

ALiBi can be seen as a special case of our method. Given the equation of ALiBi:

It means the ALiBi applies an exponential decay on the softmax attention matrices whereas ours applies it on the Toeplitz matrices.

4 Relation to other sequence modeling models

In this section, we will show the relationship between our model and other sequence modeling models such as the Transformers (Vaswani et al., 2017), CNNs (LeCun et al., 1989), and the State space (Gu et al., 2022). We also compare the theoretical space-time complexity of our model with previous sequence modeling models in Table. 1.

Transformers A Transformer with relative position embedding can be expressed as:

CNNs A convolutional layer can be viewed as a Toeplitz matrix of a special structure. Considering a 1D convolution:

Therefore, a 1D CNN can be viewed as a special case of the TNN with a zero-padded input. For better illustration, we provide a matrix form of CNN operation in Appendix C.1.

State space The equation of the State space can be expressed as:

where $\mathbf{x}_{i}$ is the input, $\mathbf{y}_{i}$ is the output, $\mathbf{u}_{i}$ is the intermediate state. According to (Gu et al., 2022), the output of the State space is:

In this case, the State space can be regarded as a special form of TNN with the coefficients that are calculated by the State space. We also provide the matrix form in Appendix C.2 for better illustration.

Experiment

We compare our method to four kinds of sequential modeling methods including attention-based methods, MLP-based methods, FFT-based methods, and State-space-based methods. In particular, we select the following methods:

Attention-based: Vanilla transformer(Vaswani et al., 2017), Transformer-LS(Zhu et al., 2021), FLASH, (Hua et al., 2022), 1+elu (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020), cosFormer (Qin et al., 2022).

MLP-based: gMLP(Liu et al., 2021), Synthesizer (Random), Synthesizer (Dense) (Tay et al., 2021).

FFT-based: FNet(Lee-Thorp et al., 2022), GFNet (Rao et al., 2021), AFNO(Guibas et al., 2021).

State-space-based: S4(Gu et al., 2022), DSS (Gupta et al., 2022), GSS(Mehta et al., 2022).

We evaluate our methods on the WikiText-103 (Merity et al., 2017) for autoregressive language modeling and the input length extrapolation ability, and the GLUE benchmark (Wang et al., 2018) for bidirectional language modeling. We also validate the accuracy and efficiency of our methods in handling long-range dependencies on the Long-Range Arena benchmark (Tay et al., 2020). To demonstrate the robustness of our model, we implement our model in DeiT (Touvron et al., 2021) structure and compare its performance with the vanilla DeiT (Touvron et al., 2021) on the ImageNet-1K (Deng et al., 2009) for image classification.

We implement our models in Pytorch (Paszke et al., 2019) and train them on 8 V100 GPUs. We adopt the same training configuration for all competitors, including batch size, learning rate, training epochs/updates, etc. More detailed hyper-parameters are listed in Appendix D.

For the autoregressive language modeling, all models are trained on the WikiText-103 dataset (Merity et al., 2017) for 50K steps with a learning rate of $0.005$ . We use perplexity (PPL) as the evaluation metric.

For the bidirectional language modeling, we choose the Roberta (Liu et al., 2019) model as the base model structure for all methods. All models are pre-trained on the WikiText-103 (Merity et al., 2017) for 50K steps with lr=0.005 and fine-tuned on the GLUE dataset (Wang et al., 2018). We use different learning rates among 1e-5, 3e-5, 6e-5, 1e-4 and choose the best result after fine-tuning for 3 epochs.

For the Long-Range Arena benchmark, we adopt the same experimental configurations from the Skyformer Chen et al. (2021). We ensure that performances and efficiencies of all methods are obtained with a similar parameter size and the same training hyperparameters.

For the image classification on the ImageNet-1k dataset, we adopt the Deit (Touvron et al., 2021) network structure and replace the transformer layers with our model.

2 Results

Autoregressive language modeling Autoregressive language modeling is a crucial task that requires the models to estimate causal probability distribution given the previously seen tokens. In Table 2, we compare the proposed TNN with competing sequence modeling models. First, compared to existing Mlp-based methods, TNN shows better performances with a clear margin on both val set and test set. Transformer-based methods are currently dominant sequence modeling methods. As a strong baseline, Transformer adopts a standard self-attention module with quadratic complexity, TNN still outperforms it on both val and test sets. in addition, TNN achieves better results than most efficient transformers including FLASH, 1+elu, Performer, and cosFormer. Finally, compared with recent emerging State-space-based sequence modeling methods, TNN achieves superior performance to all competing methods. it proves the effectiveness of our method in causal models.

Further, we also compared the extrapolation capabilities of each method. In Figure 1, we show that our method outperforms all other methods and is comparable to ALiBi (Press et al., 2022). Complete results can be found in Appendix 15.

Bidirectional language modeling We benchmark bidirectional modeling methods on the GLUE datasets in Table. 3. TNN achieves competitive results across all tasks. Further, it is worth noting that TNN boosts the results of CoLA by a significant margin, showing the ability to reason logistic information from sequences. It demonstrates the effectiveness of TNN in bidirectional language modeling.

Long-Range Arena benchmark As shown in Table 4, we compare TNN with competing methods across five tasks of the LRA benchmark. The results before the Transformer-LS are taken from Skyformer (Chen et al., 2021). As demonstrated, TNN achieves the best scores on three tasks and the second places on the left two tasks. In terms of overall results, TNN outperforms all other competing methods including S4 (Gu et al., 2022) We re-run the S4 experiments with the new configuration to match the number of parameters. For the sake of completeness, we also compare TNN with S4 in the original size of S4 using the suffix ”-Large” in Table14, which validates our ability to encode long sequences.

For speed comparison, we compare the training speed of the TNN with other methods in Table 4.3. For a fair and comprehensive comparison, we follow exactly the same configurations of the Skyformer Chen et al. (2021) and report step per second under different sequence lengths. Timing is conducted on an Nvidia A6000 GPU with 48G GPU memory.

Image modeling We report classification results on the ImageNet-1k dataset in Table 4.3. As shown, under similar parameter sizes, TNN achieves better results than Deit-Tiny and comparable results with Deit-Small. It demonstrates the capability of our method in encoding visual signals.

3 Ablation study

Network structure configuration We ablate different structure configurations on the autoregressive language modeling task in Table 4.3. We consider three options of configuration: the GTU+GLU, GTU only, and attention+GLU. We empirically find that the GTU+GLU one achieves better performance than other options and choose it as our structure in TNN.

Input of relative position encoder In Table 4.3, we ablate different RPE inputs on language modeling. (-(n-1),…,(n-1)) denotes that we feed $2n-1$ constants into the RPE. (-(n-1),…,(n-1))/n denotes normalized constants. The sin, cos denotes the absolute position embedding method used in (Vaswani et al., 2017). We empirically find that using the original integers as the input for the RPE leads to better performance.

Relative position encoder There are two ways to generate relative position coefficients for the Toeplitz matrix. One is to set these coefficients as learnable parameters and allow TNN to learn them from data. The other is to use our proposed RPE network to generate these coefficients. We compare these two strategies in Table 4.3. The TNN with our RPE network achieves an improvement of 2.47 PPL in language modeling.

Exponential decay rate We ablate different exponential decay rates in Table 10 on the language modeling. We train these model variants with a fixed sequence length of 512 and test them on a series of sequence lengths from 512 to 14336 and compute the average PPL. When there is no exponential decay, the model fails to extrapolate to a longer sequence length. We also test our model with a learnable decay rate, but it does not show better performance. We empirically select 0.99 as the exponential decay rate in our method.

Conclusion

In this paper, we propose Toeplitz neural network, a new efficient architecture that relies entirely on relative positional information for sequence modeling. The proposed model enjoys a favorable log linear space-time complexity. Thanks to the proposed relative position encoder and exponential decay techniques, Toeplitz neural network generalizes to long sequences with a fixed budget of parameters while obtaining consistently superior performance than competing methods across multiple challenging tasks, including language modeling, image modeling, and sequence modeling on long inputs, i.e., the Long-Range Arena benchmark. Toeplitz neural network is also a generic sequence modeling approach, which renders various popular architectures, such as Transformers, CNNs, and State-space-based methods, as its special forms, offering a unified view for sequence modeling.

References

Appendix A Mathematical notations

Appendix B Proof of theorem

In this section, we will prove Theorem 2.1. Before doing that, let’s first introduce the circulant matrix and Toeplitz matrix:

Based on the definition, we can give a key lemma:

The proof can be found in (Gray et al., 2006). Based on this, we can prove a key lemma:

Because $\mathbf{F},\mathbf{F}^{\top}$ is a DFT matrix, so $\mathbf{F}\mathbf{x}$ and $\mathbf{F}^{\top}\mathbf{x}$ can be done $O(n\log n)$ time (Bracewell & Bracewell, 1986). Since $\mathbf{\Lambda}$ is a diagonal matrix, so $\mathbf{\Lambda}\mathbf{x}$ can be done in $O(n)$ time, note that its diagonal elements $\mathbf{F}[a_{0},a_{1},\ldots,a_{n-1}]^{\top}$ can also be computed in $O(n\log n)$ time complexity, therefore,

Using the notation of block matrix, we can define:

Computing $\mathbf{C}\mathbf{x}_{1}$ has a time complexity of $O(2n\log(2n))=O(n\log n)$ .

$\left[\begin{array}[]{cc}\mathbf{I}_{n}&\mathbf{0}_{n\times n}\end{array}\right]\mathbf{C}\mathbf{x}_{1}$ is equivalent to selecting the first $n$ rows of $\mathbf{C}\mathbf{x}_{1}$ , the time complexity is $O(n)$ .

So the total time complexity is $O(n\log n)$ . ∎

Appendix C Matrix form of sequential models

In this section, we give the matrix form of some sequence models mentioned in section 3.4.

The matrix form of CNN mentioned in Eq. 10 is:

C.2 State Space

The Toeplitz matrix mentioned in Eq. 15 is:

Appendix D Configurations

Appendix E Experiments

Appendix F Extrapolation

Appendix G Visualization

In this section, we visualize Tnn, in particular, we choose the Toeplitz matrix used in Roberta for visualization.