Neural Architecture Search on Efficient Transformers and Beyond

Zexiang Liu, Dong Li, Kaiyue Lu, Zhen Qin, Weixuan Sun, Jiacheng Xu, Yiran Zhong

Introduction

Efficient Transformers have achieved remarkable advances in recent years. They reduce the quadratic computational complexity of the standard Transformer by spasifying or approximating Softmax attention in a more efficient fashion. Currently, the configuration of the efficient network, e.g., the number of heads and the input embedding size, is directly copied from the Transformer, which may not be suitable for efficient Transformers . The customized network structure specifically for efficient Transformers has not been well studied. However, the manual design always involves expensive engineering work and could be sub-optimal due to the human bias . Hence, in this paper, we aim to explore how to automatically find an appropriate architecture for efficient Transformers.

To achieve the goal, we need to (1) specify a suitable efficient Transformer as the target, and (2) find an approach to automating the network design. For the first point, we give a brief overview of existing efficient Transformers. In terms of how to treat the Softmax attention, they can be roughly classified into pattern-based and kernel-based . Pattern-based methods sparsify the attention matrix with predefined or learnable patterns, e.g., chunking input sequences into fixed blocks , calculating attention at fixed intervals , or using axial attention . Although the sparsity generated by specific patterns is beneficial for simplifying attention calculation, it still suffers from the quadratic complexity with respect to the input length $N$ and is complicated to implement. Differently, kernel-based methods aim at reducing the quadratic complexity into linear (i.e., $O(N)$ in both time and space complexity) . They reformulate the self-attention mechanism to avoid explicitly computing the $N\times N$ matrix. Moreover, they are easier to reproduce in practice. Considering the intriguing properties of kernel methods, we choose the cosFormer , a new kernel-based model with state-of-the-art performance among efficient Transformers, as our target. We attempt to seek for its customized and optimal architecture.

To automate the network design, we take advantage of neural architecture search (NAS) . It has been widely employed in searching standard Transformer architectures in Natural Language Processing and Computer Vision . These studies mainly focus on refining search space and/or improving search algorithms. For example, AutoAttend explores the effecitve connections between the Query, Key, and Value by creating primitive operations to the input data, e.g., $1\times 1$ convolution and $1\times 3$ max pooling. The evolved Transformer applies tournament selection to NAS to generate candidate models more robustly. However, these methods suffer from long training and large search costs because all the candidates need to be optimized, evaluated, and ranked. For the purpose of lowering these costs, we utilize RankNAS , a new efficient NAS framework for searching the standard Transformer . It can significantly speed up the search procedure through pairwise ranking, search space pruning, and the hardware-aware constraint.

Given both the efficient Transformer (i.e., cosFormer ) and the search algorithm (i.e., RankNAS ), we conduct a preliminary study on pure efficient Transformer-based NAS. Specifically, we replace Softmax with the linear attention introduced in the cosFormer, and then search it with RankNAS. To comprehensively investigate the optimal architecture, we perform the search on two representative tasks in NLP and CV fields, i.e., machine translation (WMT’14 En-De ) and image classification (CIFAR-10 ). Generally, we observe that the optimal structure of the cosFormer has fewer computational costs, e.g., smaller FLOPS as illustrated in Fig. 1. However, the general accuracy is less comparable to that of the standard Transformer (see the vertical axis in Fig. 1). This performance imbalance between accuracy and efficiency has also been revealed in other efficient Transformers .

Considering that the vanilla Softmax attention and linear attention have their own distinctions regarding the performance, we propose to use Softmax and linear attention in a mixed way for the benefit of a better balance between accuracy and efficiency (named as “mFormer”). Moreover, we expect the model to determine which type of attention to use automatically. To this end, we introduce a new search space specially for attention and incorporate it into the NAS framework. After re-searching the optimal architecture, we find that the combination of the two attention types achieves comparable performance to the Transformer with notably improved efficiency on both tasks (see Fig. 1).

In summary, we make three primary contributions:

To the best of our knowledge, it is the first work to search the optimal architecture of efficient Transformers. We employ NAS for searching the cosFormer, a representative efficient model. The searched results give new insights to the community, i.e., how to design customized, effective, and efficient network structures for efficient Transformers.

We propose a novel usage of attention, i.e., mixing Softmax attention and linear attention in the Transformer, and define a new search space for attention search in the NAS framework. This pushes existing Transformer search further by enabling automatic selection of appropriate attention types.

The proposed mixed attention achieves better balance between accuracy and efficiency, i.e., having comparable performance to the standard Transformer while maintaining good efficiency. This is validated on both machine translation and image classification tasks.

Related Work

The standard Transformer suffers from the quadratic time and space complexity caused by the Softmax attention. To reduce computational costs, efficient Transformers either sparsify or approximate Softmax in a more efficient fashion. Sparsifying the attention is adopted by the pattern-based methods, where the attention matrix is sparsified with predefined or learnable patterns. chunks the input sequence into fixed blocks, so that the complexity is reduced from $N^{2}$ to $B^{2}$ ( $B$ is the block size smaller than $N$ ). Alternatively, downsamples the input into fixed length. Instead of adjusting the sequence length, make use of strided/dilated patterns to compute attention at fixed intervals. Compared with these methods with one fixed pattern, the combination of multiple patterns can diversify the coverage of attention access . For example, calculates the attention along every axis of the input, and aggregates the attention from both strided and local patterns. To further improve the quality of patterns, learning patterns in a data-driven manner is becoming a new trend . The benefit over fixed patterns is that learnable ones are able to more precisely cluster input tokens based on token relevance while still maintaining the efficiency . In summary, pattern-based methods can enhance the efficiency by sparsifying the Softmax attention. However, they still have the quadratic complexity, and it will increase with the input length becoming larger. In addition to that, they are relatively complicated to reproduce in practice.

Another general category of efficient Transformers is based on kernels (or kernel functions), which aims to reduce the complexity from quadratic to linear. By this means, Softmax can be re-written with other forms to avoid the $N\times N$ attention matrix . To achieve this, assumes a low-rank prior within the $N\times N$ structure, and transforms the Key and Value into a lower dimension with extra projection layers. approximates Softmax with the production of a series of Gaussian kernels, and alternatively, uses Haar kernels instead. The PerFormer employs orthogonal random features to generate random kernels. The Linear Transformer makes use of the associative property of matrix products, and reformulates the similarity function with kernel decomposition. The cosFormer follows this kernelization strategy, and utilizes ReLU as the kernel function with an additional cosine reweighting mechanism. Generally, kernel-based methods can linearize the complexity and effectively enhance the efficiency. Moreover, they are easier to implement than pattern-based approaches.

2 Neural architecture search (NAS)

NAS aims to automatically find the most appropriate network architecture, and has been extensively applied to Computer Vision and Natural Language Processing . The core of NAS is to design a suitable search space, rank all candidate architectures generated from the space, and find the optimal structure. For Transformers in NLP, the Evolved Transformer (ET) is the pioneering work that applies NAS into Transformer architecture search. It defines a large-scale search space, which covers the components in the input, normalization, layers, the output dimension, activations, etc. An evolution algorithm is adopted to make architecture selection more stable and robust, i.e., consistently selecting the most promising architecture based on the fitness. Although ET can find a better and more efficient structure, its computational costs are still very large. This is because the algorithm has to cover all the search features. Besides, training the evolution process is also time-consuming.

Subsequent works focus on improving the search algorithm and/or pruning the search space. employs the differentiable architecture search (DARTS) that includes all the operations into a node (i.e., constituting a supernet) and leverages Softmax for the specific choice. This method relaxes the need of training each candidate network separately. decomposes the Transformer structure into smaller components, and introduces an one-shot search algorithm based on sampling. HAT introduces a hardware-aware constraint for accelerating the search process. RankNAS treats NAS as a pairwise ranking problem, which significantly speeds up the search process.

Pruning the search space, i.e., only retaining the most important search features, is another approach to reducing the search costs. TextNAS specifies a tailored search space for text representation, which consists of convolutional, recurrent, pooling, and self-attention layers. The Primer modifies the search space with squaring ReLU activations and a depth-wise convolutional layers after the Query, Key, and Value. AutoAttend specially searches the connections between QKV. RankNAS proposes a feature selection algorithm that measures the importance of each feature and sorts out the most useful ones for further search. In summary, search space pruning can effectively improve the search efficiency and maintain good performance.

NAS on Efficient Transformers

In this section, we first give a brief review of the preliminary knowledge of the cosFormer and RankNAS . After that, we specify the main steps of applying NAS to the efficient Transformer search and discuss the general results. Finally, we introduce the proposed idea of using mixed attention.

Linearizing attention can reduce the complexity from quadratic to linear . This is achieved by decomposing the similarity function $\rho$ and multiplying $\mathbf{K}$ and $\mathbf{V}$ first, i.e.,

where $\varphi$ is a kernel function that transforms $\mathbf{Q}$ and $\mathbf{K}$ into hidden representations. In this case, the complexity $O(N^{2}d)$ lowers to $O(Nd^{2})$ . Generally we always have $N\gg d$ , so $O(N^{2}d)\approx O(N^{2})$ and $O(Nd^{2})\approx O(N)$ , yielding the linear complexity.

Here for simplicity, we ignore the normalization term. Eq. 3 can be re-written as the linear form in Eq. 2, and we refer readers to the original paper for more details.

1.2 RankNAS

Conventional NAS methods have to evaluate the performance of every candidate network for ranking, which is time-consuming especially when the search space is large. The recently-proposed RankNAS can effectively reduce training costs with the following three distinctions:

Pairwise ranking. Instead of sorting all candidate architectures, RankNAS treats NAS as a pairwise ranking problem. That is, a binary classification is conducted to each candidate pair, and the label is simplified into correctly ordered or incorrectly ordered as per their estimated performance.

Search space pruning. RankNAS prunes the search space by only containing the most important features that have large impact to the performance.

Hardware-aware search. In addition to the standard search based on losses, RankNAS also leverages a hardware constraint, i.e., estimating the latency and discarding the fastest and slowest 10% models.

Considering the good efficiency of RankNAS, we take it as our NAS framework. More technical details can be found in .

2 RankNAS on cosFormer

Our primitive search space is adopted from RankNAS, which contains fundamental features including the embedding size, the number of encoder/decoder layers, the number of heads, and the dimension of the feed-forward network (see Fig. 4). We also make corresponding modification to the space as per each task. Detailed definition is introduced in Section LABEL:sec:setting.

We replace all Softmax with the cosFormer attention, and follow RankNAS for the search process. A supernet containing all possible architectures is firstly trained. Afterwards, the loss and latency data are separately collected and ranked. Based on the loss and latency constraints, the most important features are selected, i.e., search space pruning. The evolution algorithm is performed on the refined search space, and the optimal architecture is sorted out. Finally, we re-train the optimal network from scratch.

2.3 Discussion

For comparison, we also repeat the above steps to the standard Transformer only with the Softmax attention. For simplicity, we denote the optimal architecture of the cosFormer as “cosFormer $*$ ”, and that of the Transformer as “Transformer $*$ ”. From Fig. 1, we find that on both tasks, the cosFormer $*$ presents better efficiency than the Transformer $*$ (smaller FlOPS) with a comparable parameter scale, but the general accuracy is less competitive. We name this phenomenon as the imbalance between accuracy and efficiency, which has also been observed in other efficient Transformers .

In terms of the attention itself, the advantage of Softmax in accuracy is associated with its ability in “imposing a categorical distribution constraint on the query-context relevance scores” . This constraint has two essential properties: (1) dense coverage, i.e., attending every feature to the query ; and (2) non-linear concentration, i.e., larger weights are only assigned to more relevant features . In spite of the useful properties, we should also notice that the primary limitation of Softmax is its high computational complexity, as analyzed in Section 3.1.

2.2 Accuracy

On the WMT’14 En-De test set, the cosFormer $*$ presents extremely poor BLEUIt has the inappropriate convergence, and we are communicating with the authors of for dealing with this issue.. It reflects the limitation of the linear attention in achieving comparable accuracy to the Transformer $*$ .

We also investigate the importance of each feature to the accuracy, which is measured by RankNAS . The top-5 ranked features are:

cosFormer $*$ : En-De Connect, Dec Layer Num, En-De Head Num, Dec FFN Dim, Dec Head Num.

Transformer $*$ : Dec Layer Num, En-De Head Num, Dec FFN Dim, Dec Head Num, Enc FFN Dim.

Hence, when designing the architecture for the efficient Transformer, we need to focus more on the encoder-decoder interaction, e.g., the number of encoder layers to attend and the cross-attention heads. Both the cosFormer $*$ and Transformer $*$ are also sensitive to the decoder layer number, but the best performance does not necessarily need the largest number of layers.

2.3 Efficiency

Clearly, the cosFormer $*$ is more efficient, i.e., its FLOPS is 8.32G while that of the Transformer $*$ is 9.86G. This difference is mainly due to the more complex optimal structure of the Transformer. The Softmax attention also increases the computational costs, which will be detailed in Section 4.5. The top-5 ranked features regarding the efficiency are:

cosFormer $*$ : Dec Layer Num, En-De Head Num, En-De Connect, Enc FFN Dim, Dec Head Num.

Transformer $*$ : Dec Layer Num, Dec Head Num, Enc Head Num, Enc FFN Dim, Dec FFN Dim.

It is obvious that the number of decoder layers has the largest impact to the efficiency. For the efficient Transformer, the encoder-decoder interaction also plays an essential role in determining the computational costs. By contrast, the Transformer $*$ ’s efficiency is less sensitive to the cross interaction.

2.4 Summary

We summarize empirical hints to design an appropriate architecture for the efficient Transformer on machine translation: (1) Fewer decoder layers; (2) Focusing more on the encoder-decoder interaction. The encoder layers to attend and the number of encoder-decoder heads are not necessarily too large; (3) The encoder FFN dimension is more important to the efficiency while the decoder FFN dimension is more important to the accuracy.

3 Optimal architecture comparison on image classification

The optimal structures of the cosFormer and Transformer significantly outperform their vanilla versions in accuracy (see Table 5, which further verifies the necessity and usefulness of NAS. The FLOPS is larger because vanilla models only have 6 encoder layers. We compare optimal structures in the following.

Table 4 shows the optimal architectures of the cosFormer and Transformer . Clearly, the Transformer $*$ has more layers, a smaller embedding size, and a smaller averaged FFN dimension than the cosFormer $*$ . From a general view, the cosFormer prefers a “shallow and wide” structure while the optimal architecture of the Transformer tends to be “deep and thin”.

FFN dimension. Fig. 3.2.1(a) displays the FFN dimensions in different layers. From the tendency curve, we observe that the cosFormer $*$ has a relatively smaller FFN dimension in intermediate layers. Differently, the dimension is smaller in early layers in the Transformer $*$ .

Head number. The head numbers in encoder layers are plotted in Fig. 3.2.1(b). The cosFormer $*$ generally has more heads in early and top layers. By contrast, the head numbers in the Transformer $*$ do not vary too much and their values are large.

Parameters. The cosFormer $*$ and the Transformer $*$ have a similar parameter scale, i.e., $\sim$ 24M.

3.2 Accuracy

On the CIFAR-10 test set, the cosFormer $*$ achieves the 88.4% accuracy, which is surpassed by the Transformer $*$ (95.10%) by around 7.6%.

According to the importance to the accuracy, the features ranked by RankNAS are:

cosFormer $*$ : Layer Num, FFN Dim, Emb Dim, Head Num.

Transformer $*$ : FFN Dim, Head Num, Emb Dim, Layer Num.

Clearly, the number of layers has the largest impact to the accuracy of the cosFormer. Similar to the case in machine translation, the cosFormer $*$ does not select the greatest layer number.

3.3 Efficiency

The FLOPS of the cosFormer $*$ is 7.01G, which is significantly better than the Transformer $*$ (10.9G) by 36%. It reveals the advantage of the cosFormer’s linear attention in efficiency. The important features regarding the efficiency are:

cosFormer $*$ : Layer Num, Emb Dim, Head Num, FFN Dim.

Transformer $*$ : Layer Num, FFN Dim, Head Num, Emb Dim.

Similar to the conclusion in machine translation, the layer number in image classification is also the most related to the efficiency. The efficient model has the preference of using fewer layers, which further reduces the computational burden.

3.4 Summary

We give a brief summary of how to appropriately design the efficient Transformer in image classification: (1) Fewer layers; (2) A smaller FFN dimension in intermediate layers with a slightly larger embedding size; (3) A smaller head number in intermediate layers.

4 Our results using mixed attention

Table LABEL:tab:mt_search shows the optimal structure and the performance of our mFormer. In general, the mFormer is “shallow and thin”, which has the fewest parameters and smallest FLOPS among the three optimal models. The linear attention exists in Layer 2 and 4 in the encoder, and all the other encoder/decoder layers retain the Softmax attention. In mFormer, the FFN dimensions in the encoder undergo an “up-and-down” change, i.e., early layers rely on smaller FFN sizes. The number of heads in the encoder tends to be smaller in intermediate layers. The primary features in the decoder of the mFormer consistently have the no larger quantity than the other two, except that the head number in the first decoder layer is larger than that of the cosFormer $*$ .

In terms of the performance, the mFormer achieves comparable BLEU to the Transformer $*$ with 18% fewer parameters and 18% smaller FLOPS. It also significantly outperforms the cosFormer $*$ in accuracy with slightly better efficiency. The results indicate that the mixed use of attention can effectively facilitate the reduction of the imbalance between accuracy and efficiency.

4.2 Image classification

We display the optimal architecture of the proposed mFormer in Table 4. Intuitively, our structure is ”shallow and thin”, which is consistent with the observation in machine translation. This is also reflected by the parameter scale, i.e., we have around 20% fewer parameters than the other two. The linear attention exists in Layer 3, 7 and 11 (accounting for 25%), and the remaining attention types in other layers are all Softmax. The FFN dimensions are large in early layers, and has a “down-up-down” tendency in subsequent layers. The changes of head numbers seem to be inversely proportional to the FFN dimension, i.e., larger FFN dimensions normally correspond to smaller head numbers.

Remarkably, the mFormer outperforms the cosFormer $*$ to a large margin in accuracy (by 5.19 in percentage) while maintaining good efficiency, i.e., it only has 1.38G more FLOPS than the cosFormer $*$ . Moreover, it achieves comparable performance to the Transformer $*$ in accuracy (the latter surpasses ours only by 1.51 in percentage) but is significantly more efficient (our FLOPS is smaller than the Transformer $*$ by 2.51G, which is $\sim$ 23%). It demonstrates that the use of mixed attention can achieve a better balance between accuracy and efficiency on the image classification task.

5 Ablation study on attention types

Since the attention search is newly introduced in the NAS framework, we specially study the influence of each attention type in optimal architectures. This is to validate (1) the effectiveness of the proposed mixed attention in balancing accuracy and efficiency, and (2) the inherent difference between Softmax and linear attention in performance. The results on machine translation and image classification are reported in Table 5.

We first uniform the mixed attention and linear attention in the optimal architectures of the mFormer and cosFormer $*$ with Softmax. For our mFormer, the accuracy is slightly improved but the FLOPS becomes larger, indicating that the better accuracy is always along with the sacrifice in efficiency. This performance imbalance is more obvious when we replace all the linear attention in the cosFormer $*$ with Softmax, where the accuracy is significantly enhanced with much degraded computational efficiency. Notably, our model is able to achieve comparable accuracy to the performance with all Softmax and keep satisfactory efficiency without dropping too much.

5.2 All linear attention

In this step, we replace the mixed attention in the mFormer and Softmax attention in the Transformer $*$ uniformly with the linear attention. We notice a significant performance drop in accuracy on both tasks. It demonstrates the less competitive ability of the linear attention in yielding accurate results. However, after the replacement, the efficiency gain is remarkable, which further indicates the advantage of the efficient Transformer in reducing computational costs. The mixed use of attention is beneficial for achieving both comparable accuracy and efficiency.

Conclusion

In this paper, we utilize NAS to find the optimal architecture of efficient Transformers (i.e., the cosFormer ). The searched architectures reveal that the optimal structures of the efficient Transformer are relatively lighter than those of the standard Transformer, e.g., the reduced parameter scale and improved FLOPS. This provides useful insights to the community for the appropriate design of efficient Transformers. However, the general accuracy of efficient models is less competitive. Based on this observation, we propose a novel usage of attention, i.e., employing the linear attention and Softmax in a mixed manner in each layer. The searched optimal architecture presents comparable accuracy to the Transformer and maintains as good efficiency as the efficient Transformer. The proposed method supplies a new direction in the Transformer study, i.e., taking advantage of both Softmax and linear attention. In our future work, we will study the mixed attention on large-scale pretrain models as well as other downstream tasks.

References

Training Details

We specify training details of machine translation and image classification tasks below. For those not listed, we use the default setting in RankNAS .

2 Image classification

Feature Importance in mFormer

With RankNAS , we can know the importance of each feature when searching the proposed mFormer.

The top 5 features regarding the accuracy are: Dec Layer Num, En-De Head Num, En-De Head Num, Enc Attn Type, and Dec FFN Dim. Similar to the Transformer $*$ , the most important feature is the number of decoder layers. In our optimal structure, this number is 2, which indicates that the best performance does not necessarily need many layers. Besides, the encoder-decoder interaction also has large impact to the accuracy, which requires careful design. The proposed attention type is important as well, further validating the effectiveness of attention search.

The top 5 features regarding the efficiency are: Dec Layer Num, Dec Head Num, En-De Connect, En-De Head Num, and Enc FFN Dim. Again, the number of decoder layers has the largest impact to the efficiency. Note that the proposed attention type is not contained here, which is because it only exists in encoders with 6 layers and the efficiency gain is thus not very large.

2 Image classification

The top 5 features regarding the accuracy are: Enc Attn Type, Enc Layer Num, Enc FFN Dim, Enc Head Num, and Enc Embed Dim. The top 5 features regarding the efficiency are: Enc Layer Num, Enc Attn Type, Enc Embed Dim, Enc Head Num, and Enc FFN Dim.

Clearly, on this task, our attention type plays a crucial role in both accuracy and efficiency, which well verifies the effectiveness of the mixed use of attention.

Linear Attention in Decoders

On the machine translation task, when we put the linear attention in decoder layers, the BLEU values are all 0 (corresponding to three cases: 1) only decoder self attention, 2) only encoder-decoder attention, and 3) both). We also find that the output sentences have the “repeater” problem, i.e., repeating a single word for many times. We have not found the specific reason for this issue, and we are communicating it with the authors in . We infer that the potential reason is the locality in the cosFormer attention may be more suitable for feature fusion in encoders.