Linearized Relative Positional Encoding

Zhen Qin, Weixuan Sun, Kaiyue Lu, Hui Deng, Dongxu Li, Xiaodong Han, Yuchao Dai, Lingpeng Kong, Yiran Zhong

Introduction

Transformers have achieved remarkable progress in natural language processing (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020), computer vision (Dosovitskiy et al., 2020; Liu et al., 2021; Arnab et al., 2021) and audio processing (Karita et al., 2019; Zhang et al., 2020; Gulati et al., 2020; Sun et al., 2022). As an important ingredient in transformers, positional encoding assigns a unique representation for each position of a token in a sequence so that the transformers can break the permutation invariance property. Among these encoding methods, absolute positional encoding (Vaswani et al., 2017; Sukhbaatar et al., 2015; Devlin et al., 2019; Liu et al., 2020) maps each individual position index into a continuous encoding. Whereas relative positional encoding (Shaw et al., 2018; Su et al., 2021; Horn et al., 2021; Liutkus et al., 2021; Huang et al., 2020; Raffel et al., 2019) generates encoding for each query-key pair, representing their relative positional offset. We focus on relative positional encoding as they are not constrained by input lengths (Chen, 2021) while showing superior performance (Shaw et al., 2018).

Linear transformers (Chen, 2021; Qin et al., 2022b; a; Liu et al., 2022; Lu et al., 2022) attract more attention recently as they can achieve linear space-time complexity with respect to input sequence length while maintaining comparable performance with vanilla transformers. Most existing linear transformers use absolute positional encoding methods to encode positional information, since most existing relative positional encoding methods are designed for vanilla transformers and are not directly applicable to linear transformers. The main cause behind this limitation is that linear transformers decompose key and value representations in the self-attention modules into separate kernel functions to achieve linear space-time complexity. Such an additional requirement on the decomposibility is not always satisfied by existing relative positional encoding methods. On the other hand, despite some individual works (Qin et al., 2022b; Chen, 2021), general principles for designing relative positional encoding for linear transformers remain largely understudied. A recent work, RoPE Su et al. (2021) proposes a new set of multiplicative encoding solutions based on rotational positional encoding and can be applied to linear transformers. In Appendix D.1, we show that RoPE can be seen as a special form of LRPE.

In this work, we aim to bridge this gap and study the principal framework to develop relative positional encoding applicable to linear transformers. To this end, we start by presenting a canonical form of relative positional encoding, which reveals that differences in existing encoding methods boil down to choices of a set of query, key, and relative positional matrix primitives. By properly selecting and composing these primitives, we could derive various existing encoding methods for transformers.

Taking advantage of the canonical form, we introduce the main contribution of our work, i.e., a special family of relative positional encoding methods called linearized relative positional encoding (LRPE). Specifically, we supply a sufficient condition for designing compatible encoding methods, especially for linear transformers, and prove that the linearized relative positional encoding is a unitary transformation. The benefits of using unitary transformation are twofold. On one side, since it is derived from the decomposable positional matrix, it can maintain the linear space-time complexity as shown in Fig. 1. Second, the unitary transformation property allows us to effectively derive the family of closed-form solutions. In particular, we show that a number of encoding methods pertain to the LRPE family, including those used in RoPE (Su et al., 2021) and PermuteFormer (Chen, 2021).

Furthermore, LRPE sheds light on a simple yet flexible theoretical paradigm to develop new effective relative positional encoding. To demonstrate this, we derive non-exhaustively three additional LRPE encoding methods by parameterizing the generic solution differently, including solutions living in either real or complex domains. Since unitary transformations are special cases of a relative positional matrix, LRPE is applicable to linear transformers and exclusively suitable within encoder and/or decoder layers. We experimentally demonstrate the effectiveness of the LRPE family on autoregressive and bidirectional language modeling, text classification, and image classification. Results show that LRPE achieves superior capability in representing relative positional information, commonly resulting in unrivaled performance than previous encoding methods.

In summary, our main contributions are as follows:

We present a canonical form of relative positional encoding, which derives most existing relative positional encoding methods as its special case, including those used in linear transformers.

Based on the canonical form, we propose linearized relative position encoding (LRPE), a simple yet principal formulation to derive an encoding family that respects the linear space-time complexity in linear transformers. We show several existing relative positional encoding methods in linear transformers are in LRPE family. We also provide additional solutions from this generic form.

Experiments on various downstream tasks, such as language modeling, text classification, and image classification show that the LRPE family is more robust and consistently produces better results across tasks than previous relative encoding methods, are flexible in being plugged into encoder and/or decoder layers in linear models. In addition, it is generic to derive existing and potentially new encoding methods.

Background and Preliminary

2 Positional encoding

Self-attention is capable of parallel sequence processing but cannot capture positional information of each token. To address this issue, positional encoding methods are proposed, which can be generally categorized into two groups: absolute positional encoding and relative positional encoding.

where the encoding formulation only depends on the absolute position index ss, and the positional encoding size is restricted by the input sequence length.

Relative positional encoding considers relative position offsets between two input tokens (Shaw et al., 2018; Qin et al., 2023), i.e.,

where s,ts,t are the two positional indeices, est\mathbf{e}_{st} denotes the attention score before softmax. Compared to absolute positional encoding, relative positional encoding generally achieves better performance as it can handle variable input length (Chen, 2021). However, extra cost on computation and memory makes it not so efficient than absolute positional encoding (Likhomanenko et al., 2021).

Most existing relative positional encoding methods (Raffel et al., 2019; Shaw et al., 2018; Huang et al., 2020; Chi et al., 2022) require computing query-key attention QKT\mathbf{Q}\mathbf{K}^{\mkern-1.5mu\mathsf{T}} and combine with relative positional information, which incurs quadratic complexity. In contrast, linear attention avoids such a query-key product to achieve the linear complexity. Therefore, common relative positional encoding methods are usually not applicable in linear transformers.

Our Method

In this section, we present our main technical contribution on linearized relative positional encoding, which is an encoding family that preserves linear space-time complexity. Specifically, we start by presenting a canonical form of relative positional encoding and show that existing encoding methods can be derived by instantiating the canonical form with different choices of so-called primitive queries, keys, and positional matrices in Section 3.1. When imposing the decomposability constraint on this canonical form, we obtain a sufficient condition for linearized relative positional encoding (LRPE) and derive a family of concrete solutions in real and complex domains in Section 3.2. We provide an implementation sketch in Section 3.3.

To demonstrate Eq. 6 is a generic formulation, we show that it flexibly induces a wide range of existing relative encoding methods (Shaw et al., 2018; Su et al., 2021; Horn et al., 2021; Liutkus et al., 2021; Huang et al., 2020; Raffel et al., 2019) by selecting and compositing different choices of primitives. Among them, we highlight four examples in the following section and leave the complete discussions in the Appendix C.1.

Additive. In (Huang et al., 2020), the relative positional encoding is formulated as an extra additive term to the query-key inner-product:

which can be derived by including an extra identity term as a primitive, formally denoted as:

Multiplicative. In RoPE (Su et al., 2021), the relative positional encoding works in the form of the weighted inner product:

1.2 Simplification

For the ease of the remaining discussion, we introduce the necessary notations and simplify Eq. 6.

2 Linearized relative position encoding

Eq. 6 is a canonical form of relative positional encoding, meaning that its variants are applicable to vanilla transformers but not necessarily for linear ones. To design relative encoding compatible with linear transformers, the attention computation has to respect the decomposibilty condition. This additional condition leads to the linearized relative position encoding (LRPE) family, defined as follows.

A relative position encoding is called linearized relative position encoding (LRPE), when the following holds:

The assumption of W0=Id\mathbf{W}_{0}=\mathbf{I}_{d} implies that the interaction between tokens from the same position only depends on the content, which is reasonable enough that most encoding methods respect. In its essence, Eq. 13 ensures the positional matrix is decomposable. In this way, the query-key inner-product can be avoided in the attention computation. Consequently, complexity of computing LRPE is O(nd2)O(nd^{2}), where nn is sequence length, dd is embedding dimension as Appendix C.2 shows in detail.

We prove that Eq. 13 can be simplified based on the following proposition:

Eq. 13 is equivalent to Eq. 14 and Wt\mathbf{W}_{t} is Unitary matrix,

According to the arbitrariness of qs,kt\mathbf{q}_{s},\mathbf{k}_{t}, Eq. 13 is equivalent to

Take s=ts=t in Eq 13, we get (since we assume that W0=Id\mathbf{W}_{0}=\mathbf{I}_{d}):

Thus, Ms\mathbf{M}_{s} is a unitary matrix. On the other hand, note that for any unitary matrix P\mathbf{P}, we always have

This means that left multiplying Mt\mathbf{M}_{t} by a unitary matrix P\mathbf{P} does not change Eq. 13. Since Ms\mathbf{M}_{s} and M0H\mathbf{M}_{0}^{\mathsf{H}} are also unitary matrices, we can perform the following transformation:

With Ms\overline{\mathbf{M}}_{s}, Eq. 15 becomes

Since Ms\overline{\mathbf{M}}_{s} is a unitary matrix, Ws\mathbf{W}_{s} is also a unitary matrix, i.e.,

In the following section, we derive some particular solutions of Eq. 14.

In this section, we discuss Eq. 14 and give a family of solutions. It is worth noting that the solutions we provide are all in the form of Ws=PHΛ(s)P\mathbf{W}_{s}=\mathbf{P}^{\mathsf{H}}\mathbf{\Lambda}^{(s)}\mathbf{P}, where P,Λ(s)\mathbf{P},\mathbf{\Lambda}^{(s)} are unitary matrices. The complete derivation can be found in Appendix C.4, C.5, C.6.

Unitary (Solution 1) The first case is discussed in the complex domain, which is not common in transformer models yet exhibiting an elegant solution.

Orthogonal (Solution 2) Now we consider the real domain, a more general case in transformers.

Permutation (Solution 3) The last case is inspired by PermuteFormer (Chen, 2021), which is associated with the permutation matrix:

3 The LRPE family

LRPE (Ws=PHΛ(s)P\mathbf{W}_{s}=\mathbf{P}^{\mathsf{H}}\mathbf{\Lambda}^{(s)}\mathbf{P}) contains two components, i.e., a fixed unitary matrix P\mathbf{P} and a unitary matrix family Λ(s)\mathbf{\Lambda}^{(s)} as mentioned in proposition 3.3, 3.4, and 3.5. The P\mathbf{P} can be seen as a rotation matrix that rotates the token feature to a particular coordinate system and the Λ(s)\mathbf{\Lambda}^{(s)} derives the positional information from the rotated feature.

To meet all the requirements in proposition 3.3, 3.4, and 3.5, P\mathbf{P} needs to be an orthogonal matrix. We empirically find that when P\mathbf{P} is a householder matrix (Golub & Van Loan, 2013), the overall performance is better than other options such as permutation matrix and Identity matrix. We provide a detailed ablation in Table 5. For ease of expression, we use Type 1 for the unitary solution, Type 2 for the orthogonal solution, and Type 3 for the permutation solution. Details can be found in Appendix D.1.

Experiments

In this section, we validate the effectiveness of the proposed LRPE on natural language processing tasks and computer vision tasks that resort to different Transformer architectures. Specifically, we first study the autoregressive language model (Radford et al., 2018). This is followed by the bidirectional language model, which adopts the Roberta architecture (Liu et al., 2020) and is pretrained and then fine-tuned on several downstream tasks from the GLUE benchmark (Wang et al., 2018). We also extend our evaluation on image classification task to verify the generalization ability of LRPE.

We use Wikitext-103 Merity et al. (2016), Books Zhu et al. (2015), and WikiBook Wettig et al. (2022) datasets for NLP task evaluation and ImageNet-1k Deng et al. (2009) for image classification evaluation. Wikitext-103 is a small dataset containing a preprocessed version of the Wikipedia dataset. Books consists of a large number of novels, making it suitable for long sequence modeling evaluation. WikiBook is a large corpus (22 GB) of Wikipedia articles and books collected by Wettig et al. (2022). ImageNet-1k is the most popular large image classification dataset. It contains 1000 object classes and over 1 million training images and is often used to verify the performance of models in image modeling.

Our experiments are implemented in the Fairseq framework (Ott et al., 2019) and trained with V100 GPUs. All the methods share the same configurations such as learning rate, batch size, and optimizer. The detailed configurations are listed in Appendix E.

2 Results

Autoregressive language model. The autoregressive language model has 6 decoder layers and is trained on the WikiText-103 dataset (Merity et al., 2017). In order to test the performance of the method on long sequence modeling, we tested the performance of the model on the Books (Zhu et al., 2015) dataset. We use the Perplexity (PPL) as the evaluation metric and report the results in Table 2. We observe that all variants of LRPE present a performance gain over the baseline. Notably, Type 1 and Type 2 models achieve the best performance on Wikitext-103 and Books, respectively, demonstrating superior capability in language modeling.

Bidirectional language model. The bidirectional model follows an encoder-only structure, i.e., Roberta (Liu et al., 2020), with 12 layers. In order to verify the performance of the model on a large data set, we adopt the Wikibook dataset used by Wettig et al. (2022) for pre-training and used their configurations to update 23k times. The results are in Table 1 and Figure 2. In the pre-training phase, LRPE outperforms all competitors. Next, we fine-tune the model for the GLUE task. As shown in Table 1, our method outperforms competing methods on all tasks with a clear margin.

Image classification model. To verify the robustness and effectiveness of LRPE under different modal tasks, we test our method on the computer vision domain. Specifically, we conduct experiments on Imagenet-1k Deng et al. (2009) dataset using the Deit-small architecture Touvron et al. (2021) on the image classification task. In particular, we replace the Attention with Linear Attention Katharopoulos et al. (2020) and then adopt various relative positional encoding. As shown in Table 3, LRPE beats all the competing methods.

Long-Range Arena. In order to validate the effectiveness of LRPE on long-sequence modeling tasks, we conducted experiments on Long-Range Arena benchmark (Tay et al., 2020). As shown in Table 4, LRPE has positive effects on almost all tasks.

3 Discussion

According to the discussion in Section. 3.3, The proposed LRPE rotates the token feature through P\mathbf{P}, and encodes the positional information through Λ(s)\mathbf{\Lambda}^{(s)} . In Table 5, we ablate the effectiveness of the P\mathbf{P} matrix on the autoregressive language modeling task. Our approach with the Householder matrix achieves better results than the one equipped with other metrics. It indicates that we can get better performance by carefully selecting the projection of the positional encoding.

The implementation of the proposed LRPE does not affect the computational complexity of the linear transformer, i.e., preserving the linear complexity as O(n)O(n). We also measure the training speed of the bidirectional language model on the same local machine and observe that the speed after using LRPE is only a bit slower than the baseline on average. The detailed comparison of the efficiency can be found in Table 6. In general, LRPE does not incur a significant computational burden to the transformer, and can fulfill the practical needs by maintaining comparable efficiency.

Conclusion

In this paper, we standardize the form of relative positional encoding for linear attention. The unitary transformation is employed as a special solution to the linearized relative positional encoding, and the solutions as per various constraints constitute the unitary relative positional encoding (LRPE) family. We validate the effectiveness of LRPE through extensive experiments on both natural language processing and computer vision tasks with different transformer architectures. It outperforms competing methods in all tasks. In addition, it highlights a broad paradigm for formulating linear transformer-applicable positional encoding techniques that are more generically relative.

References

Appendix A Mathematical Notations

Appendix B Computation of Vanilla/Linear Attention

B.2 Vanilla Attention

In vanilla attention, the output is computed using the Softmax weighted sum, i.e.,

B.3 Linear Attention

The linear attention is formulated as follows,

Appendix C Proof of Theorem

In the following, we provide two additional examples of relative positional encoding with the canonical form.

which indicates that the relative positional encoding is effectively a coefficient term in the attention matrix, as such, it can be derived via a positional matrix primitive with the coefficients.

C.2 Speed analysis

For this, we only need to prove that the time complexity is linear with respect to nn. To this end, we first give basic notations as follows,

Clearly, Eq. 38 is a standard formulation for the linear attention with the time complexity as O(nd2)O(nd^{2}). Combing it with the first step, we have the total time complexity as O(nd2)O(nd^{2}), which is unchanged. ∎

C.3 Linearized Relative Positional Encoding

Before the proof, we first give the following theorems (Yao & Algebra, 2015):

C.4 Unitary (Solution 1)

In this case, k=1,,d\forall k=1,\ldots,d, we have

Note that 2lπ2l\pi does not affect the result, so we can assume l=0l=0, i.e.,

C.5 Orthogonal (Solution 2)

For A(s)\mathbf{A}^{(s)}, considering the kk-th component, we get

Note that 2tπ2t\pi does not affect the result, so we can assume t=0t=0, i.e.,

Next, for B(s)\mathbf{B}^{(s)}, the conclusion is more obvious, i.e.,

C.6 Permutation

Prior to the proof, we first provide some relevant definitions and propositions.

Permutation π\pi is a bijection defined on the integer set:

For Λk\mathbf{\Lambda}_{k}, we have the following important properties:

For Λk\mathbf{\Lambda}_{k} defined in C.5, we have:

For k=1k=1, the conclusion is obvious. Now assuming that the conclusion holds for k=s1k=s-1, when k=sk=s, we have

We first prove that the conclusion holds for k=1k=1:

Since Λ1\mathbf{\Lambda}_{1} is a square matrix, we also have

Based on the above conclusions, we can prove Proposition 3.5 below.

The next step is to verify that it satisfies Eq. 14, which follows Theorem C.7 and C.8:

Appendix D Implementation

LRPE(Ws=PHΛ(s)P\mathbf{W}_{s}=\mathbf{P}^{\mathsf{H}}\mathbf{\Lambda}^{(s)}\mathbf{P}) contains two components, i.e., the fixed unitary matrix P\mathbf{P} and the unitary matrix family Λ(s)\mathbf{\Lambda}^{(s)} mentioned in proposition 3.3, 3.4, and 3.5. We first introduce the choice of matrices P/Λ(s)\mathbf{P}/{\mathbf{\Lambda}}^{(s)}, and then illustrate some implementation tricks.

For matrix P\mathbf{P}, We list the species mentioned in the paper below:

In our implementation, we sample v\mathbf{v} from a standard normal distribution, and make it deterministic.

Permutation matrix: formulated as per the following permutation (inspired by Flash (Hua et al., 2022)), i.e.,

For matrix family Λ(s)\mathbf{\Lambda}^{(s)}, we use the following settings:

For unitary (Solution 1) (3.3), we use the same method in (Su et al., 2021) with initialized αt=100002t/d\alpha_{t}=10000^{-2t/d}, and make it learnable.

For orthogonal (Solution 2) (3.4), we choose the dimension of identity submatrix q=0q=0 with initialized αt=100002t/d\alpha_{t}=10000^{-2t/d} as in (Su et al., 2021) and make it learnable.

Another notable version to choose the dimension of the identity submatrix q=0q=0 with initialized αt=100002t/d\alpha_{t}=10000^{-2t/d} as in (Su et al., 2021), and make it deterministic. When using this version along with the identity matrix, we can get RoPE (Su et al., 2021).

For permutation (Solution 3) (3.5), we randomly choose the permutation and make it deterministic.

Notice that when combing this method with identity matrix, we can get a version of PermuteFormer (Chen, 2021).

According to the following facts, we can simplify the computation, i.e.,

Hence, in practice, we can use Ws=PHΛ(s)\mathbf{W}_{s}=\mathbf{P}^{\mathsf{H}}\mathbf{\Lambda}^{(s)} instead of Ws=PHΛ(s)P\mathbf{W}_{s}=\mathbf{P}^{\mathsf{H}}\mathbf{\Lambda}^{(s)}\mathbf{P} to reduce the computational costs.

D.2 Pseudocode

In this section, we provide pseudocodes for LRPE in Python:

Appendix E Configuration