Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller

cs.LG cs.CL stat.ML

Introduction and related work

Transformers (Vaswani et al., 2017; Dehghani et al., 2019) are powerful neural network architectures that have become SOTA in several areas of machine learning including natural language processing (NLP) (e.g. speech recognition (Luo et al., 2020)), neural machine translation (NMT) (Chen et al., 2018), document generation/summarization, time series prediction, generative modeling (e.g. image generation (Parmar et al., 2018)), music generation (Huang et al., 2019), and bioinformatics (Rives et al., 2019; Madani et al., 2020; Ingraham et al., 2019; Elnaggar et al., 2019; Du et al., 2020).

Transformers rely on a trainable attention mechanism that identifies complex dependencies between the elements of each input sequence. Unfortunately, the regular Transformer scales quadratically with the number of tokens $L$ in the input sequence, which is prohibitively expensive for large $L$ and precludes its usage in settings with limited computational resources even for moderate values of $L$ . Several solutions have been proposed to address this issue (Beltagy et al., 2020; Gulati et al., 2020; Chan et al., 2020; Child et al., 2019; Bello et al., 2019). Most approaches restrict the attention mechanism to attend to local neighborhoods (Parmar et al., 2018) or incorporate structural priors on attention such as sparsity (Child et al., 2019), pooling-based compression (Rae et al., 2020) clustering/binning/convolution techniques (e.g. (Roy et al., 2020) which applies $k$ -means clustering to learn dynamic sparse attention regions, or (Kitaev et al., 2020), where locality sensitive hashing is used to group together tokens of similar embeddings), sliding windows (Beltagy et al., 2020), or truncated targeting (Chelba et al., 2020). There is also a long line of research on using dense attention matrices, but defined by low-rank kernels substituting softmax (Katharopoulos et al., 2020; Shen et al., 2018). Those methods critically rely on kernels admitting explicit representations as dot-products of finite positive-feature vectors.

The approaches above do not aim to approximate regular attention, but rather propose simpler and more tractable attention mechanisms, often by incorporating additional constraints (e.g. identical query and key sets as in (Kitaev et al., 2020)), or by trading regular with sparse attention using more layers (Child et al., 2019). Unfortunately, there is a lack of rigorous guarantees for the representation power produced by such methods, and sometimes the validity of sparsity patterns can only be verified empirically through trial and error by constructing special GPU operations (e.g. either writing C++ CUDA kernels (Child et al., 2019) or using TVMs (Beltagy et al., 2020)). Other techniques which aim to reduce Transformers’ space complexity include reversible residual layers allowing one-time activation storage in training (Kitaev et al., 2020) and shared attention weights (Xiao et al., 2019). These constraints may impede application to long-sequence problems, where approximations of the attention mechanism are not sufficient. Approximations based on truncated back-propagation (Dai et al., 2019) are also unable to capture long-distance correlations since the gradients are only propagated inside a localized window. Other methods propose biased estimation of regular attention but only in the non-causal setting and with large mean squared error (Wang et al., 2020).

In response, we introduce the first Transformer architectures, Performers, capable of provably accurate and practical estimation of regular (softmax) full-rank attention, but of only linear space and time complexity and not relying on any priors such as sparsity or low-rankness. Performers use the Fast Attention Via positive Orthogonal Random features (FAVOR+) mechanism, leveraging new methods for approximating softmax and Gaussian kernels, which we propose. We believe these methods are of independent interest, contributing to the theory of scalable kernel methods. Consequently, Performers are the first linear architectures fully compatible (via small amounts of fine-tuning) with regular Transformers, providing strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and lower variance of the approximation.

FAVOR+ can be also applied to efficiently model other kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, that are beyond the reach of regular Transformers, and find for them optimal attention-kernels. FAVOR+ can also be applied beyond the Transformer scope as a more scalable replacement for regular attention, which itself has a wide variety of uses in computer vision (Fu et al., 2019), reinforcement learning (Zambaldi et al., 2019), training with softmax cross entropy loss, and even combinatorial optimization (Vinyals et al., 2015).

We test Performers on a rich set of tasks ranging from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing the effectiveness of the novel attention-learning paradigm leveraged by Performers. We emphasize that in principle, FAVOR+ can also be combined with other techniques, such as reversible layers (Kitaev et al., 2020) or cluster-based attention (Roy et al., 2020).

FAVOR+ Mechanism & Positive Orthogonal Random Features

Another important type of attention is unidirectional dot-product attention which has the form:

We will show that attention matrix $\mathbf{A}$ can be approximated up to any precision in time $O(Ld^{2}\log(d))$ . For comparison, popular methods leveraging sparsity via Locality-Sensitive Hashing (LSH) techniques (Kitaev et al., 2020) have $O(Ld^{2}\log L)$ time complexity. In the main body of the paper we will describe FAVOR+ for bidirectional attention. Completely analogous results can be obtained for the unidirectional variant via the mechanism of prefix-sums (all details in the Appendix B.1).

2 Generalized Kernelizable Attention

The above scheme constitutes the FA-part of the FAVOR+ mechanism. The remaining OR+ part answers the following questions: (1) How expressive is the attention model defined in Equation 3, and in particular, can we use it in principle to approximate regular softmax attention ? (2) How do we implement it robustly in practice, and in particular, can we choose $r\ll L$ for $L\gg d$ to obtain desired space and time complexity gains? We answer these questions in the next sections.

3 How to and how not to approximate softmax-kernels for Attention

We propose a robust mechanism in this paper. Furthermore, the variance of our new unbiased positive random feature map estimator tends to as approximated values tend to (see: Section 3).

4 Orthogonal Random Features (ORFs)

The above constitutes the R+ part of the FAVOR+ method. It remains to explain the O-part. To further reduce the variance of the estimator (so that we can use an even smaller number of random features $r$ ), we entangle different random samples $\omega_{1},...,\omega_{m}$ to be exactly orthogonal. This can be done while maintaining unbiasedness whenever isotropic distributions $\mathcal{D}$ are used (i.e. in particular in all kernels we considered so far) by the standard Gram-Schmidt orthogonalization procedure (see (Choromanski et al., 2017) for details). ORFs is a well-known method, yet it turns out that it works particularly well with our introduced PRFs for softmax. This leads to the first theoretical results showing that ORFs can be applied to reduce the variance of softmax/Gaussian kernel estimators for any dimensionality $d$ rather than just asymptotically for large enough $d$ (as is the case for previous methods, see: next section) and leads to the first exponentially small bounds on large deviations probabilities that are strictly smaller than for non-orthogonal methods. Positivity of random features plays a key role in these bounds. The ORF mechanism requires $m\leq d$ , but this will be the case in all our experiments. The pseudocode of the entire FAVOR+ algorithm is given in Appendix B.

Our theoretical results are tightly aligned with experiments. We show in Section 4 that PRFs+ORFs drastically improve accuracy of the approximation of the attention matrix and enable us to reduce $r$ which results in an accurate as well as space and time efficient mechanism which we call FAVOR+.

Theoretical results

We present here the theory of positive orthogonal random features for softmax-kernel estimation. All these results can be applied also to the Gaussian kernel, since as explained in the previous section, one can be obtained from the other by renormalization (see: Section 2.3). All proofs and additional more general theoretical results with a discussion are given in the Appendix.

Furthermore, the latter holds for $d\geq 2$ even if the $L_{\infty}$ -norm condition is not satisfied, i.e. the regularized softmax-kernel is a universal lower bound for the softmax-kernel.

For the regularized softmax-kernel, orthogonal features provide additional concentration results - the first exponentially small bounds for probabilities of estimators’ tails that are strictly better than for non-orthogonal variants for every $d>0$ . Our next result enables us to explicitly estimate the gap.

Experiments

We implemented our setup on top of pre-existing Transformer training code in Jax (Frostig et al., 2018) optimized with just-in-time (jax.jit) compilation, and complement our theory with empirical evidence to demonstrate the practicality of FAVOR+ in multiple settings. Unless explicitly stated, a Performer replaces only the attention component with our method, while all other components are exactly the same as for the regular Transformer. For shorthand notation, we denote unidirectional/causal modelling as (U) and bidirectional/masked language modelling as (B).

In terms of baselines, we use other Transformer models for comparison, although some of them are restricted to only one case - e.g. Reformer (Kitaev et al., 2020) is only (U), and Linformer (Wang et al., 2020) is only (B). Furthermore, we use PG-19 (Rae et al., 2020) as an alternative (B) pretraining benchmark, as it is made for long-length sequence training compared to the (now publicly unavailable) BookCorpus (Zhu et al., 2015) + Wikipedia dataset used in BERT (Devlin et al., 2018) and Linformer. All model and tokenization hyperparameters are shown in Appendix A.

We compared speed-wise the backward pass of the Transformer and the Performer in (B) setting, as it is one of the main computational bottlenecks during training, when using the regular default size $(n_{heads},n_{layers},d_{ff},d)=(8,6,2048,512)$ , where $d_{ff}$ denotes the width of the MLP layers. We observed (Fig. 3) that in terms of $L$ , the Performer reaches nearly linear time and sub-quadratic memory consumption (since the explicit $O(L^{2})$ attention matrix is not stored). In fact, the Performer achieves nearly optimal speedup and memory efficiency possible, depicted by the "X"-line when attention is replaced with the "identity function" simply returning the $\mathbf{V}$ -matrix. The combination of both memory and backward pass efficiencies for large $L$ allows respectively, large batch training and lower wall clock time per gradient step. Extensive additional results are demonstrated in Appendix E by varying layers, raw attention, and architecture sizes.

2 Softmax attention approximation error

We further examined the approximation error via FAVOR+ in Fig. 4. We demonstrate that 1. Orthogonal features produce lower error than unstructured (IID) features, 2. Positive features produce lower error than trigonometric $\sin$ / $\cos$ features. These two empirically validate the PORF mechanism.

To further improve overall approximation of attention blocks across multiple iterations which further improves training, random samples should be periodically redrawn (Fig. 5, right). This is a cheap procedure, but can be further optimized (Appendix B.2).

3 Softmax approximation on Transformers

Even if the approximation of the attention mechanism is tight, small errors can easily propagate throughout multiple Transformer layers (e.g. MLPs, multiple heads), as we show in Fig. 14 (Appendix). In other words, the model’s Lipschitz constant can easily scale up small attention approximation error, which means that very tight approximations may sometimes be needed. Thus, when applying FAVOR(+)’s softmax approximations on a Transformer model (i.e. "Performer-X-SOFTMAX"), we demonstrate that:

1. Backwards compatibility with pretrained models is available as a benefit from softmax approximation, via small finetuning (required due to error propagation) even for trigonometric features (Fig. 5, left) on the LM1B dataset (Chelba et al., 2014). However, when on larger dataset PG-19, 2. Positive (POS) softmax features (with redrawing) become crucial for achieving performance matching regular Transformers (Fig. 5, right).

4 Multiple layer training for proteins

5 Large length training - Common datasets

On the standard (U) ImageNet64 benchmark from (Parmar et al., 2018) with $L=12288$ which is unfeasible for regular Transformers, we set all models to use the same $(n_{heads},d_{ff},d)$ but varying $n_{layers}$ . Performer/6-layers matches the Reformer/12-layers, while the Performer/12-layers matches the Reformer/24-layers (Fig. 7: left). Depending on hardware (TPU or GPU), we also found that the Performer can be 2x faster than the Reformer via Jax optimizations for the (U) setting.

For a proof of principle study, we also create an initial protein benchmark for predicting interactions among groups of proteins by concatenating protein sequences to length $L=8192$ from TrEMBL, long enough to model protein interaction networks without the large sequence alignments required by existing methods (Cong et al., 2019). In this setting, a regular Transformer overloads memory even at a batch size of $1$ per chip, by a wide margin. Thus as a baseline, we were forced to use a significantly smaller variant, reducing to $(n_{heads},n_{layers},d_{ff},d)=(8,\{1,2,3\},256,256)$ . Meanwhile, the Performer trains efficiently at a batch size of 8 per chip using the standard $(8,6,2048,512)$ architecture. We see in Fig. 7 (right subfigure) that the smaller Transformer ( $n_{layer}=3$ ) is quickly bounded at $\approx 19\%$ , while the Performer is able to train continuously to $\approx 24\%$ .

Conclusion

Broader impact

We believe that the presented algorithm can be impactful in various ways:

Biology and Medicine: Our method has the potential to directly impact research on biological sequence analysis by enabling the Transformer to be applied to much longer sequences without constraints on the structure of the attention matrix. The initial application that we consider is the prediction of interactions between proteins on the proteome scale. Recently published approaches require large evolutionary sequence alignments, a bottleneck for applications to mammalian genomes (Cong et al., 2019). The potentially broad translational impact of applying these approaches to biological sequences was one of the main motivations of this work. We believe that modern bioinformatics can immensely benefit from new machine learning techniques with Transformers being among the most promising. Scaling up these methods to train faster more accurate language models opens the door to the ability to design sets of molecules with pre-specified interaction properties. These approaches could be used to augment existing physics-based design strategies that are of critical importance for example in the development of new nanoparticle vaccines (Marcandalli et al., 2019).

Research on Transformers: We believe that our results can shape research on efficient Transformers architectures, guiding the field towards methods with strong mathematical foundations. Our research may also hopefully extend Transformers also beyond their standard scope (e.g. by considering the Generalized Attention mechanism and connections with kernels). Exploring scalable Transformer architectures that can handle $L$ of the order of magnitude few thousands and more, preserving accuracy of the baseline at the same time, is a gateway to new breakthroughs in bio-informatics, e.g. language modeling for proteins, as we explained in the paper. Our presented method can be potentially a first step.

Backward Compatibility: Our Performer can be used on the top of a regular pre-trained Transformer as opposed to other Transformer variants. Even if up-training is not required, FAVOR+ can still be used for fast inference with no loss of accuracy. We think about this backward compatibility as a very important additional feature of the presented techniques that might be particularly attractive for practitioners.

Attention Beyond Transformers: Finally, FAVOR+ can be applied to approximate exact attention also outside the scope of Transformers. This opens a large volume of new potential applications including: hierarchical attention networks (HANS) (Yang et al., 2016), graph attention networks (Velickovic et al., 2018), image processing (Fu et al., 2019), and reinforcement learning/robotics (Tang et al., 2020).

Acknowledgements

We thank Nikita Kitaev and Wojciech Gajewski for multiple discussions on the Reformer, and also thank Aurko Roy and Ashish Vaswani for multiple discussions on the Routing Transformer. We further thank Joshua Meier, John Platt, and Tom Weingarten for many fruitful discussions on biological data and useful comments on this draft. We lastly thank Yi Tay and Mostafa Dehghani for discussions on comparing baselines.

Valerii Likhosherstov acknowledges support from the Cambridge Trust and DeepMind. Lucy Colwell acknowledges support from the Simons Foundation. Adrian Weller acknowledges support from a Turing AI Fellowship under grant EP/V025379/1, The Alan Turing Institute under EPSRC grant EP/N510129/1 and U/B/000074, and the Leverhulme Trust via CFI.

References

APPENDIX: Rethinking Attention with Performers

A Hyperparameters for experiments

This optimal setting (including comparisons to approximate softmax) we use for the Performer is specified in the Generalized Attention (Subsec. A.4), and unless specifically mentioned (e.g. using name "Performer-SOFTMAX"), "Performer" refers to using this generalized attention setting.

We report the following evaluation metrics:

Accuracy: For unidirectional models, we measure the accuracy on next-token prediction, averaged across all sequence positions in the dataset. For bidirectional models, we mask each token with $15\%$ probability (same as (Devlin et al., 2018)) and measure accuracy across the masked positions.

Perplexity: For unidirectional models, we measure perplexity across all sequence positions in the dataset. For bidirectional models, similar to the accuracy case, we measure perplexity across the masked positions.

Bits Per Dimension/Character (BPD/BPC): This calculated by loss divided by $\ln(2)$ .

We used the full evaluation dataset for TrEMBL in the plots in the main section, while for other datasets such as ImageNet64 and PG-19 which have very large evaluation dataset sizes, we used random batches (>2048 samples) for plotting curves.

The PG-19 dataset (Rae et al., 2020) is presented as a challenging long range text modeling task. It consists of out-of-copyright Project Gutenberg books published before 1919. It does not have a fixed vocabulary size, instead opting for any tokenization which can model an arbitrary string of text. We use a unigram SentencePiece vocabulary (Kudo & Richardson, 2018) with 32768 tokens, which maintains whitespace and is completely invertible to the original book text. Perplexities are calculated as the average log-likelihood per token, multiplied by the ratio of the sentencepiece tokenization to number of tokens in the original dataset. The original dataset token count per split is: train=1973136207, validation=3007061, test=6966499. Our sentencepiece tokenization yields the following token counts per split: train=3084760726, valid=4656945, and test=10699704. This gives log likelihood multipliers of train=1.5634, valid=1.5487, test=1.5359 per split before computing perplexity, which is equal to $\exp(\text{log likelihood multiplier}*\text{loss})$ .

Preprocessing for TrEMBL is extensively explained in Appendix C.

A.2 Training Hyperparameters

Unless specifically stated, all Performer + Transformer runs by default used $0.5$ grad clip, $0.1$ weight decay, $0.1$ dropout, $10^{-3}$ fixed learning rate with Adam hyperparameters $(\beta_{1}=0.9,\beta_{2}=0.98,\epsilon=10^{-9})$ , with batch size maximized (until TPU memory overload) for a specific model.

All 36-layer protein experiments used the same amount of compute (i.e. 16x16 TPU-v2, 8GB per chip). For concatenated experiments, 16x16 TPU-v2’s were also used for the Performer, while 8x8’s were used for the 1-3 layer $(d=256)$ Transformer models (using 16x16 did not make a difference in accuracy).

Note that Performers are using the same training hyperparameters as Transformers, yet achieving competitive results - this shows that FAVOR can act as a simple drop-in without needing much tuning.

A.3 Approximate Softmax Attention Default Values

The optimal values, set to default parametershttps://github.com/google-research/google-research/blob/master/performer/fast_attention , are: renormalize_attention = True, numerical stabilizer = $10^{-6}$ , number of features = 256, ortho_features = True, ortho_scaling = 0.0.

A.4 Generalized Attention Default Values

The optimal values, set to default parametershttps://github.com/google-research/google-research/blob/master/performer/fast_attention , are: renormalize_attention = True, numerical stabilizer = 0.0, number of features = 256, kernel = ReLU, kernel_epsilon = $10^{-3}$ .

A.5 Reformer Default Values

For the Reformer, we used the same hyperparameters as mentioned for protein experiments, without gradient clipping, while using the defaultshttps://github.com/google/trax/blob/master/trax/supervised/configs/reformer_imagenet64.gin (which instead use learning rate decay) for ImageNet-64. In both cases, the Reformer used the same default LSH attention parameters.

A.6 Linformer Default Values

Using our standard pipeline as mentioned above, we replaced the attention function with the Linformer variant via Jax, with $\delta=10^{-6},k=600$ (same notation used in the paper (Wang et al., 2020)), where $\delta$ is the exponent in a renormalization procedure using $e^{-\delta}$ as a multiplier in order to approximate softmax, while $k$ is the dimension of the projections of the $\mathbf{Q}$ and $\mathbf{K}$ matrices. As a sanity check, we found that our Linformer implementation in Jax correctly approximated exact softmax’s output within $0.02$ error for all entries.

Note that for rigorous comparisons, our Linformer hyperparameters are even stronger than the defaults found in (Wang et al., 2020), as:

We use $k=600$ , which is more than twice than the default $k=256$ from the paper, and also twice than our default $m=256$ number of features.

We also use redrawing, which avoids "unlucky" projections on $\mathbf{Q}$ and $\mathbf{K}$ .

B Main Algorithm: FAVOR+

We outline the main algorithm for FAVOR+ formally:

We explain how our analysis from Section 2.2 can be extended to the unidirectional mechanism in this section. Notice that this time attention matrix $\mathbf{A}$ is masked, i.e. all its entries not in the lower-triangular part (which contains the diagonal) are zeroed (see also Fig. 8).

B.2 Orthogonal Random Features - Extensions

As mentioned in the main text, for isotropic $\Omega$ (true for most practical applications, including regular attention), instead of sampling $\omega_{i}$ independently, we can use orthogonal random features (ORF) (Yu et al., 2016; Choromanski et al., 2017; 2018b): these maintain the marginal distributions of samples $\omega_{i}$ while enforcing that different samples are orthogonal. If we need $m>d$ , ORFs still can be used locally within each $d\times d$ block of $\mathbf{W}$ (Yu et al., 2016).

ORFs were introduced to reduce the variance of Monte Carlo estimators (Yu et al., 2016; Choromanski et al., 2017; 2018b; 2019a; Rowland et al., 2019; Choromanski et al., 2018a; 2019b) and we showed in the theoretical and experimental sections from the main body that they do indeed lead to more accurate approximations and substantially better downstream results. There exist several variants of the ORF-mechanism and in the main body we discussed only the base one (that we refer to here as regular). Below we briefly review the most efficient ORF mechanisms (based on their strengths and costs) to present the most complete picture.

B.3 Time and Space Complexity - Detailed Analysis

We see that a variant of bidirectional FAVOR+ using iid samples or R-ORFs has $O(md+Ld+mL)$ space complexity as opposed to $\Theta(L^{2}+Ld)$ space complexity of the baseline. Unidirectional FAVOR+ using fast prefix-sum pre-computation in parallel (Ladner & Fischer, 1980; Cormen et al., 2009) has $O(mLd)$ space complexity to store $\mathbf{G}^{\textrm{PS}}$ which can be reduced to $O(md+Ld+mL)$ by running a simple (though non-parallel in $L$ ) aggregation of $\mathbf{G}^{\textrm{PS}}_{i,:,:}$ without storing the whole tensor $\mathbf{G}^{\textrm{PS}}$ in memory. From Subsec. B.2, we know that if instead we use G-ORFs, then space complexity is reduced to $O(m\log(d)+Ld+mL)$ and if the H-ORFs mechanism is used, then space is further reduced to $O(m+Ld+mL)=O(Ld+mL)$ . Thus for $m,d\ll L$ all our variants provide substantial space complexity improvements since they do not need to store the attention matrix explicitly.

The time complexity of Algorithm 1 is $O(Lmd)$ (note that constructing $\mathbf{Q}^{\prime}$ and $\mathbf{K}^{\prime}$ can be done in time $O(Lmd)$ ). Note that the time complexity of our method is much lower than $O(L^{2}d)$ of the baseline for $L\gg m$ .

As explained in Subsec. B.2, the R-ORF mechanism incurs an extra one-time $O(md^{2})$ cost (negligible compared to the $O(Lmd)$ term for $L\gg d$ ). H-ORFs or G-ORFs do not have this cost, and when FAVOR+ uses them, computing $\mathbf{Q}^{\prime}$ and $\mathbf{K}^{\prime}$ can be conducted in time $O(L\log(m)d)$ as opposed to $O(Lmd)$ (see: Subsec. B.2). Thus even though H/G-ORFs do not change the asymptotic time complexity, they improve the constant factor from the leading term. This might play an important role in training very large models.

The number of random features $m$ allows a trade-off between computational complexity and the level of approximation: bigger $m$ results in higher computation costs, but also in a lower variance of the estimate of $\mathbf{A}$ . In the theoretical section from the main body we showed that in practice we can take $M=\Theta(d\log(d))$ .

Observe that the FAVOR+ algorithm is highly-parallelizable, and benefits from fast matrix multiplication and broadcasted operations on GPUs or TPUs.

C Experimental Details for Protein Modeling Tasks

We used the TrEMBL datasethttps://www.uniprot.org/statistics/TrEMBL, which contains 139,394,261 sequences of which 106,030,080 are unique. While the training dataset appears smaller than the one used in Madani et al. (Madani et al., 2020), we argue that it includes most of the relevant sequences. Specifically, the TrEMBL dataset consists of the subset of UniProtKB sequences that have been computationally analyzed but not manually curated, and accounts for $\approx 99.5\%$ of the total number of sequences in the UniProtKB datasethttps://www.uniprot.org/uniprot/.

Following the methodology described in Madani et al. (Madani et al., 2020), we used both an OOD-Test set, where a selected subset of Pfam families are held-out for valuation, and an IID split, where the remaining protein sequences are split randomly into train, valid, and test tests. We held-out the following protein families (PF18369, PF04680, PF17988, PF12325, PF03272, PF03938, PF17724, PF10696, PF11968, PF04153, PF06173, PF12378, PF04420, PF10841, PF06917, PF03492, PF06905, PF15340, PF17055, PF05318), which resulted in 29,696 OOD sequences. We note that, due to deduplication and potential TrEMBL version mismatch, our OOD-Test set does not match exactly the one in Madani et al. (Madani et al., 2020). We also note that this OOD-Test selection methodology does not guarantee that the evaluation sequences are within a minimum distance from the sequences used during training. In future work, we will include rigorous distance based splits.

The statistics for the resulting dataset splits are reported in Table 1. In the standard sequence modeling task, given the length statistics that are reported in the table, we clip single sequences to maximum length $L=1024$ , which results in few sequences being truncated significantly.

In the long sequence task, the training and validation sets are obtained by concatenating the sequences, separated by an end-of-sequence token, and grouping the resulting chain into non-overlapping sequences of length $L=8192$ .

C.2 Empirical Baseline

A random baseline, with uniform probability across all the vocabulary tokens at every position, has accuracy $5\%$ (when including only the 20 standard amino acids) and $4\%$ (when also including the 5 anomalous amino acids (Consortium, 2019)). However, the empirical frequencies of the various amino acids in our dataset may be far from uniform, so we also consider an empirical baseline where the amino acid probabilities are proportional to their empirical frequencies in the training set.

Figure 9 shows the estimated empirical distribution. We use both the standard and anomalous amino acids, and we crop sequences to length 1024 to match the data processing performed for the Transformer models. The figure shows only the 20 standard amino acids, colored by their class, for comparison with the visualization on the TrEMBL web pagehttps://www.uniprot.org/statistics/TrEMBL.

C.3 Tabular Results

Table 2 contains the results on the single protein sequence modeling task ( $L=1024$ ). We report accuracy and perplexity as defined in Appendix A:

C.4 Attention Matrix Illustration

In this section we illustrate the attention matrices produced by a Performer model. We focus on the bidirectional case and choose one Performer model trained on the standard single-sequence TrEMBL task for over 500K steps. The same analysis can be applied to unidirectional Performers as well.

We note that while the Transformer model instantiates the attention matrix in order to compute the attention output that incorporates the (queries $Q$ , keys $K$ , values $V$ ) triplet (see Eq. 1 in the main paper), the FAVOR mechanism returns the attention output directly (see Algorithm 1). To account for this discrepancy, we extract the attention matrices by applying each attention mechanism twice: once on each original $(Q,K,V)$ triple to obtain the attention output, and once on a modified $(Q,K,V^{\circ})$ triple, where $V^{\circ}$ contains one-hot indicators for each position index, to obtain the attention matrix. The choice of $V^{\circ}$ ensures that the dimension of the attention output is equal to the sequence length, and that a non-zero output on a dimension $i$ can only arise from a non-zero attention weight to the $i^{th}$ sequence position. Indeed, in the Transformer case, when comparing the output of this procedure with the instantiated attention matrix, the outputs match.

Attention matrix example. We start by visualizing the attention matrix for an individual protein sequence. We use the BPT1_BOVIN protein sequencehttps://www.uniprot.org/uniprot/P00974, one of the most extensively studied globular proteins, which contains 100 amino acids. In Figure 10, we show the attention matrices for the first 4 layers. Note that many heads show a diagonal pattern, where each node attends to its neighbors, and some heads show a vertical pattern, where each head attends to the same fixed positions. These patterns are consistent with the patterns found in Transformer models trained on natural language (Kovaleva et al., 2019). In Figure 12 we highlight these attention patterns by focusing on the first 25 tokens, and in Figure 11, we illustrate in more detail two attention heads.

Amino acid similarity. Furthermore, we analyze the amino-acid similarity matrix estimated from the attention matrices produced by the Performer model, as described in Vig et al. (Vig et al., 2020). We aggregate the attention matrix across 800 sequences. The resulting similarity matrix is illustrated in Figure 13. Note that the Performer recognizes highly similar amino acid pairs such as (D, E) and (F, Y).

D Extended approximation and comparison results

Although mentioned previously (Sec. 4.2) that the Performer with additional finetuning is backwards compatible with the Transformer, we demonstrate below in Fig. 14 that error propagation due to non-attention components of the Transformer is one of the primary reasons that pretrained Transformer weights cannot be immediately used for inference on the corresponding Performer.

D.2 Approximate Softmax - Extended Properties

We show the following properties of our softmax approximation, in Fig. 15:

Redrawing: While the benefits of redrawing features was shown in Subsec. 4.3 of the main body of the paper, we also demonstrate its benefits when there are multiple layers with large scale (16x16 TPU-v2) training.

Unidirectional: While we have shown on TrEMBL that Performer with generalized ReLU attention outperforms softmax, we also show that approximate softmax attention can still be a solid choice, for example on ImageNet64 (U). After 100K steps of training, the Performer-ReLU, Performer-Softmax, and Performer-Softmax (SMREG) variants achieve respectively, 3.67, 3.69, 3.67 BPD.

Instability of Trigonometric Features: We see the full view of the unstable training curve when using Trigonometric softmax.

D.3 Generalized Attention

D.4 Comparison with Linear Transformer

We use the attention implementation of the Linear Transformer from (Katharopoulos et al., 2020), which mainly involves setting our feature map $\phi(x)=\text{elu}(x)+1$ , where $\text{elu}(x)$ is the shifted-eLU function from (Clevert et al., 2016).

For the sake of fairness and to prevent confounding results, while (Katharopoulos et al., 2020) also uses the GeLU nonlinearity for the MLPs in the Linear Transformer, we instead use the original ReLU nonlinearity. We also used the exact same training hyperparameters as Performer-ReLU on our exact ProGen setting from Fig. 6. Ultimately, we empirically found that the Linear Transformer possessed numerical instability during training via unstable training curves, ultimately stopping training by producing exploding gradients (NaNs) (Fig. 18).

D.5 Long Range Arena

Performers are compared against many additional (scalable and not scalable) methods not included in our paper: Local Attention, Sparse Attention, Longformer, Sinkhorn Transformer, Synthesizer, Big Bird and the aforementioned Linear Transformer on challenging long range context tasks in the Long Range Arena (Tay et al., 2021), with Fig. 19 displaying the original paper’s results. Performers obtain the largest LRA (Long Range Arena) score among all tested scalable Transformers methods (which we define by having speed of > 100 examples/sec).

Tasks used for comparison include: (1) a longer variation of the standard ListOps task proposed in (Nangia & Bowman, 2018), (2) byte-level text classification using real-world data, (3) byte-level document retrieval, (4) image classification on sequences of pixels, and (5) Pathfinder task (long-range spatial dependency problem). In the Long Range Arena paper, the authors found that all models do not learn anything on Path-X task (denoted by FAIL), contrary to the Pathfinder task, which shows that increasing the sequence length can cause seriously difficulties for model training.

E Computation costs - Extended results

In this subsection, we empirically measure computational costs in terms wall clock time on forward and backward passes for three scenarios in Fig. 20:

Performer, with varying number of layers. We show that our method can scale up to (but not necessarily limited to) even 20 layers.

Attention time complexities when comparing standard attention (from Transformer) and FAVOR (from Performer). Note that the maximum memory size here is not reflective of the maximum memory size in an actual model (shown below), as this benchmark requires computing explicit tensors (causing memory increases) in Jax, while a model does not.

Time complexities when comparing the Transformer and Performer models. "X" (OPT) denotes the maximum possible speedup achievable, when attention simply returns the $\mathbf{V}$ -vector, showing that the Performer is nearly optimal. We see that the maximum possible power of 2 length allowed on a V100 GPU (16GB) is $2^{15}=32768$ using regular dimensions.

Since some of the computational bottleneck in the Transformer may originate from the extra feed-forward layers (Kitaev et al., 2020), we also benchmark the “Small" version, i.e. $(n_{heads},n_{layers},d_{ff},d)=(1,6,64,64)$ as well, when the attention component is the dominant source of computation and memory. We remind the reader that the “Regular" version consists of $(n_{heads},n_{layers},d_{ff},d)=(8,6,2048,512)$ .

F Theoretical results

We provide here the proofs of all theoretical results presented in the paper.

F.2 Proof of Lemma 2

Denote: $\mathbf{z}=\mathbf{x}+\mathbf{y}$ and $\Delta=\mathbf{x}-\mathbf{y}$ . Note that by using standard trigonometric identities (and the fact that the variance of the sum of independent random variables is the sum of variances of those random variables), we can get the following for $\omega\sim\mathcal{N}(0,\mathbf{I}_{d})$ :

Using the fact that (see: Lemma 1 in (Yu et al., 2016); note that in that lemma they use notation: $z$ for what we denote as: $\|\Delta\|$ ):

The above immediately follows from the fact that positive random feature maps provide unbiased estimation of the softmax-kernel, thus the following is true:

where the last equality follows from Equation 16. Therefore we have:

F.3 Proof of Theorem 1

The proof of that fact can be found in the supplement of (Choromanski et al., 2018b), yet we provide it below for completeness and the convenience of the Reader:

Thus we conclude that: $F(k,d)=\frac{k-1}{d+1}F(k-2,d+2)$ . Therefore we have:

We again conduct partial integration and get:

where $w=\frac{\|\mathbf{z}\|^{2}}{2}$ and $f(k,d)=\frac{d^{k}}{(d+2k-2)(d+2k-4)\cdot...\cdot d}$ .

Note first that for $k\geq 1$ we have: $f(k,d)\leq 1$ , thus:

where the last equality is implied by the formula for the Laplace Transform for the Poisson random variable:

F.4 Proofs of Theorem 2,Theorem 3 & Beautiful Functions

We will provide here much more general theoretical results which will imply Theorem 3 and Theorem 2. We need the following definition:

Interestingly, beautiful functions can be used to define softmax and consequently, Gaussian kernels (both standard and regularized), leading to our PRF mechanism presented in the main body of the paper, as we explain below.

If one takes $\Omega=\mathcal{N}(0,\mathbf{I}_{d})$ (note that $\mathcal{N}(0,\mathbf{I}_{d})$ is isotropic) and $g:x\rightarrow\exp(x)$ (such $g$ is clearly entire with nonnegative power-series coefficient) then the following is true for $\mathbf{z}=\mathbf{x}+\mathbf{y}$ :

where $X=g(\mathbf{w}^{\top}\mathbf{z})$ , $\mathbf{w}\sim\mathcal{D}$ .

The above result provides us with exponentially small (in Legendre Transform) upper bounds on tail probabilities for the standard estimator. Below we provide our two main theoretical results.

If $F_{\Omega,g}$ is a beautiful function then the following holds for $m\leq d$ , $X$ as in Lemma 4 and any $a>F(\mathbf{z})$ , $\theta>0$ :

This result shows that features obtained from the ensembles of pairwise orthogonal random vectors provide exponentially small bounds on tail probabilities and that these bounds are strictly better than for estimators using unstructured features. Furthermore, the result is universal, i.e. holds for any dimensionality $d$ , not just asymptotically for $d$ large enough.

We also obtain similar result regarding mean squared errors (MSEs) of the considered estimators:

If $F_{\Omega,g}$ is a beautiful function then the following holds for $m\leq d$ :

As before, an orthogonal estimator leads to better concentration results and as before, this is the case for any $d>0$ , not only asymptotically for large enough $d$ .

Note that from what we have said above, Theorem 2 and Theorem 3 follow immediately from Theorem 6 and Theorem 5 respectively.

Thus in the remainder of this section we will prove Theorem 6 and Theorem 5.

F.4.2 Proof of Theorem 5

Note that by the analogous application of Markov’s Inequality as in Lemma 4, we get:

where $f(x)=\sum_{s=0}^{\infty}a_{s}x^{s}$ and $a_{0},a_{1},...\geq 0$ . Similarly,

for some ordered subsets of indices (with potentially repeating entries) $\mathcal{D}(j_{1},...,j_{m})$ and some nonnegative $\widehat{c}_{j_{1},\dots,j_{m}}(d_{1},\dots,d_{m})$ (exact formula for those can be given but we do not need it to complete the proof and since it is technical, it would unnecessarily complicate the proof so we skip it) and $\widehat{\Delta}(d_{1},...,d_{m})$ defined as:

Our next goal is to re-write the formula for $\widehat{\Delta}(d_{1},...,d_{m})$ . Denote:

Observe that $Y$ has the same distribution as $Y^{\prime}$ defined as:

where $\mathbf{v}\sim\frac{\mathbf{g}}{\|\mathbf{g}\|_{2}}$ .

Denote $\mathbf{g}=(g_{1},...,g_{d})^{\top}$ . Thus we obtain:

Now let us focus on the second expression from the formula on $\widehat{\Delta}(d_{1},...,d_{m})$ . We have:

Note that by Lemma 5, we can rewrite the right expression from the formula on $\widehat{\Delta}(d_{1},...,d_{m})$ as:

The left expression from the formula on $\widehat{\Delta}(d_{1},...,d_{m})$ can be rewritten as:

where $\tau(d_{1},...,d_{m})$ is defined as:

We need now few observations regarding $\widehat{\Delta}(d_{1},...,d_{m})$ . Note firsr that since odd moments of the Gaussian scalar distribution $\mathcal{N}(0,1)$ are zero, $\widehat{\Delta}(d_{1},...,d_{m})$ is zero if at least of of $d_{i}$ is odd. Furthermore, $\widehat{\Delta(d_{1},...,d_{m})}$ is trivially zero if all but at most one $d_{i}$ are zero.

With our new notation, $\Delta$ can be rewritten as:

Therefore (see: our observations on $\widehat{\Delta}(d_{1},...,d_{m})$ ) to complete the proof it suffices to show that: $\tau(d_{1},...,d_{m})\leq\frac{d}{d+2}$ if at least two: $d_{i}$ , $d_{j}$ for $i\neq j$ are nonzero and all $d_{i}$ are even.

The following holds if for some $i\neq j$ we have: $d_{i},d_{j}>0$ and all $d_{i}$ are even:

Note that $\tau(d_{1},...,d_{m})$ can be rewritten as:

where $\mu_{d}(j)$ stands for the $j^{th}$ moment of the $\chi$ -distribution with $d$ degrees of freedom. Note that $\mu_{d}(j)=2^{\frac{j}{2}}\frac{\Gamma(\frac{d+j}{2})}{\Gamma(\frac{d}{2})}$ , where $\Gamma$ is the so-called Gamma-function.

By $\mathcal{D}^{\prime}(j_{1},\dots,j_{m})$ denote a subset of $\mathcal{D}(j_{1},\dots,j_{m})$ formed by only keeping $d_{1},\dots,d_{m}$ such that for some $i\neq j$ , $d_{i},d_{j}>0$ and all $d_{i}$ are even. As we have shown above, $\widehat{\Delta}(d_{1},\dots,d_{m})=0$ when $(d_{1},\dots,d_{m})\notin\mathcal{D}^{\prime}(j_{1},\dots,j_{m})$ . Otherwise,

are nonnegative, we’ll get a lower bound on $\Delta$ by only taking a subset of these terms. For this subset, we take $j=4$ , a subset of $\mathcal{S}_{4}$ with only two nonzero $j_{k_{1}}=j_{k_{2}}=2$ for some $k_{1}\neq k_{2}$ (there are $\binom{m}{2}$ combinations of such $j_{1},\dots,j_{m}$ ). Then, we take only those $d_{1},\dots,d_{m}$ from $\mathcal{D}(j_{1},\dots,j_{m})$ which correspond to $s=1$ in (49) for $k_{1},k_{2}$ and $s=0$ for all other $k$ ’s. Hence, $d_{k_{1}}=d_{k_{2}}=2$ and all other $d_{k}$ ’s are zero and the corresponding weight from the second sum in (67) would be $a_{1}^{2}a_{0}^{m-2}$ . For $d_{1},\dots,d_{m}$ in such set, we’ll have $\tau(d_{1},\dots,d_{m})\leq\frac{d}{d+2}$ by Lemma 6 and, hence, $\widehat{\Delta}(d_{1},\dots,d_{m})\geq\frac{2}{d+2}\Lambda(d_{1},\dots,d_{m})$ . As the result, we get the following lower bound on $\Delta$ :

F.4.3 Proof of Theorem 6

for $\omega\sim\Omega$ and $r\sim\mathcal{N}(0,1)$ .

Hence, we can rewrite (74) by excluding terms which are definitely zero and using Lemma 6:

F.5 Proof of Theorem 4

We prove the more general version of Theorem 4 from the main body of the paper:

Taking $\delta=\frac{\epsilon}{g^{*}h^{*}}$ completes the proof. ∎

F.6 Discussion of Theorem 4

As a consequence of Theorem 4, the number $m$ of random projections required to approximate the attention matrix within $\epsilon$ error is a function of data dimensionality $d$ , the parameter $\epsilon$ and the radius $R$ of the ball within which the queries and keys live:

The dependence on $d$ and $\epsilon$ is fairly easy to understand: with a larger dimensionality $d$ we need more random projeections (on the order of magnitude $d\log(d)$ ) to get an approximation within $\epsilon$ error. The dependence on $R$ means that the length of queries and keys cannot grow at a fixed $m$ if we want to retain the quality of the approximation. In particular, this means that FAVOR cannot approximate hard attention on sequences of unlimited length with a fixed $m$ . When the sequence length increases, even the standard attention requires longer and longer vectors to make the softmax concentrated enough to pick single elements. Nevertheless, as seen in our experiments, this limitation does not manifest itself in practice at the lengths we experimented with.