An Empirical Study of Spatial Attention Mechanisms in Deep Networks

Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, Jifeng Dai

Introduction

Attention mechanisms enable a neural network to focus more on relevant elements of the input than on irrelevant parts. They were first studied in natural language processing (NLP), where encoder-decoder attention modules were developed to facilitate neural machine translation . In computing the output for a given query element (e.g., a target word in the output sentence), certain key elements (e.g., source words in the input sentence) are prioritized according to the query. Later, self-attention modules were presented for modeling intra-sentence relations , where both the key and query are from the same set of elements. In a milestone paper , the Transformer attention module is presented, superseding past works and substantially surpassing their performance. The success of attention modeling in NLP has led to its adoption in computer vision, where different variants of Transformer attention are applied to recognition tasks such as object detection and semantic segmentation , where the query and key are visual elements such as image pixels or regions of interest.

In determining the attention weight assigned to a certain key for a given query, there exist just a few properties of the input that are commonly considered. One is the content of the query. For the case of self-attention, the query content may be the features at the query pixel in an image, or of a word in a sentence. Another is the content of the key, where a key may be a pixel within a local neighborhood of the query, or another word within the sentence. The third is the relative position of the query and key.

Based on these input properties, there are four possible attention factors from which the attention weight for a key with respect to a query is determined, as these factors must account for information about the key. Specifically, these factors are (1) the query and key content, (2) the query content and relative position, (3) the key content only, and (4) the relative position only. In the latest version of Transformer attention , attention weights are expressed as a sum of four terms ( $\mathcal{E}_{1}$ , $\mathcal{E}_{2}$ , $\mathcal{E}_{3}$ , $\mathcal{E}_{4}$ ), one for each of these attention factors as illustrated in Fig. 1. The nature of the dependencies involved with these terms vary. For example, the first two ( $\mathcal{E}_{1}$ , $\mathcal{E}_{2}$ ) are sensitive to the query content. While, the latter two ( $\mathcal{E}_{3}$ , $\mathcal{E}_{4}$ ) do not account for query content, but rather they mainly capture salient key elements and exploit global positional biases, respectively. Although attention weights can be decomposed into terms based on these factors, their relative significance in various inference problems has not been closely examined in the literature. Moreover, prevalent modules like deformable convolution and dynamic convolution , though seemingly orthogonal to Transformer attention, also employ mechanisms that focus on certain parts of an input. Whether these modules can all be viewed from a unified perspective and how their operational mechanisms differ also have not been explored.

In this work, we perceive Transformer attention, deformable convolution, and dynamic convolution modules as various instantiations of spatial attention, involving different subsets of the attention factors and accounting for these factors in different ways. Towards disentangling the effects of different attention factors and mechanisms, we present an empirical study of spatial attention, in which various elements of attention mechanisms are ablated within a generalized attention formulation. This investigation is conducted on a variety of applications, namely neural machine translation, semantic segmentation, and object detection. From this study, we find that: 1) In the Transformer attention module, the query-sensitive terms, especially the query and key content term, play a minor role in self-attention. But in encoder-decoder attention, the query and key content term is vital. 2) Though deformable convolution utilizes an attention mechanism based only on the query content and relative position term, it operates more effectively and efficiently on image recognition than the counterpart in Transformer attention. 3) In self-attention, the factors of query content & relative position and key content only are the most important. A proper combination of deformable convolution and the key content only term in Transformer attention delivers higher accuracy than that of the Transformer attention module, with much lower computational overhead on image recognition tasks.

The observations made in this paper challenge the conventional understanding of current spatial attention mechanisms. For example, it is widely believed that their success can mainly be attributed to query-sensitive attention, especially the query and key content term. This understanding perhaps originates from the initial success of the encoder-decoder attention module in neural machine translation. Thus, in some recent variants , like the non-local block and criss-cross attention module , only the query and key content term is kept, with all the other terms removed. These modules still function well in self-attention applications, which strengthen this perception. However, our study suggests that this understanding is incorrect. We find that these attention modules with only query-sensitive terms actually perform on par with those with only query-irrelevant terms. Our study further suggests that this degeneration is likely due to the design of the attention modules, rather than an inherent characteristic of self-attention, since deformable convolution is found to exploit query content & relative position effectively and efficiently in image recognition tasks.

This empirical analysis suggests that there is much room for improvement in the design of spatial attention mechanisms in deep networks. Its findings are used in this paper to make some initial headway in this direction, and it is hoped that this study will spur further investigation into the operational mechanisms used in modeling spatial attention.

Related Work

Development and application of attention-based modules. The field of NLP has witnessed steady development of attention mechanisms in recent years . Starting from the introduction of an attention module in neural machine translation , various attention factors and weight assignment functions based on these factors have been utilized. In , the inner product of vectors encoding query and key contents is recommended for computing attention weights, and absolute spatial positions are incorporated as an attention factor. In , the weight assignment additionally accounts for the inner product of spatial positions encoded in high-dimensional vectors. The landmark work of Transformer set a new standard, and its latest variants use relative positions instead of absolute positions for better generalization ability . In this paper, we conduct the empirical study on the latest instantiation of Transformer attention from this family of works.

Motivated by their success in NLP tasks , attention mechanisms have also been employed in computer vision applications such as relational reasoning among objects , image captioning , image generation , image recognition , and video recognition . In vision, the key and query refer to visual elements, but aside from that, most of these works use a formulation similar to Transformer attention. Since the effects of different attention module elements may vary with the target application, we conduct the empirical study on three different tasks that have been influenced greatly by attention modeling, namely neural machine translation in NLP, and object detection and semantic segmentation in computer vision.

Aside from Transformer attention, there are variants of convolution, such as deformable convolution and dynamic convolution , that also can be viewed as types of attention mechanisms which operate on a subset of the attention factors using different attention weight functions. They also are included in the study for examination.

It is worth mentioning a dual form of spatial attention, called channel-wise feature attention . As different feature channels encode different semantic concepts, these works seek to capture the correlations among these concepts through activation/deactivation of certain channels. Meanwhile, in the spatial domain, relationships among elements at different spatial positions are modeled, with the same attention weights on feature channels assigned to related spatial positions. The development of channel-wise feature attention has been focused on certain image recognition tasks, like semantic segmentation and image classification. In this paper, our empirical study specifically examines spatial attention mechanisms designed for broad application.

Analysis of spatial attention mechanisms. There exists relatively little analysis of spatial attention mechanisms despite their prevalence in deep networks. This research has largely been conducted by visualizing or analyzing the learned attention weights of a whole attention module on only NLP tasks . Many works suggest that attention weight assignment in encoder-decoder attention plays a role similar to word alignment in traditional approaches . The implicit underlying assumption in these works is that the input elements accorded high attention weights are responsible for the model outputs. However, recent research casts doubt on this assumption , finding that attention weights do not correlate well with feature importance measures, and that counterfactual attention weight configurations do not yield corresponding changes in prediction.

In this paper, we conduct the first comprehensive empirical study on the elements of spatial attention modules over both NLP and computer vision tasks. Different attention factors and weight assignment functions are carefully disentangled, with their effects directly measured by the final performance on these tasks.

Study of Spatial Attention Mechanisms

To facilitate our study, we develop a generalized attention formulation that is able to represent various module designs. We then show how the dominant attention mechanisms can be represented within this formulation, and how ablations can be conducted using this formulation with respect to different attention module elements.

Given a query element and a set of key elements, an attention function adaptively aggregates the key contents according to attention weights that measure the compatibility of query-key pairs. To allow the model to attend to key contents from different representation subspaces and different positions, the outputs of multiple attention functions (heads) are linearly aggregated with learnable weights. Let $q$ index a query element with content $z_{q}$ , and $k$ index a key element with content $x_{k}$ . Then the multi-head attention feature $y_{q}$ is computed as

where $m$ indexes the attention head, $\Omega_{q}$ specifies the supporting key region for the query, $A_{m}(q,k,z_{q},x_{k})$ denotes the attention weights in the $m$ -th attention head, and $W_{m}$ and $W_{m}^{\prime}$ are learnable weights. Usually, the attention weights are normalized within $\Omega_{q}$ , as $\sum_{k\in\Omega_{q}}A_{m}(q,k,z_{q},x_{k})=1$ .

In encoder-decoder attention, the key and the query are from two different sets of elements, where in most applications the two sets of elements need to be properly aligned. For example, in the encoder-decoder attention of neural machine translation, the key and the query elements correspond to the words in the input and the output sentences, respectively, where proper alignment is necessary for correct translation. Meanwhile, in self-attention, the key and the query are from the same set of elements. For example, both the key and the query are of words in the input or output sentence. In such scenarios, the self-attention mechanism is expected to capture intra-relationships among the elements, and usually the query and the key contents are modeled by the same set of features, i.e., $x=z$ .

In the most recent instantiation of the Transformer attention module , the attention weight of each query-key pair is computed as the sum of four terms $\{\mathcal{E}_{j}\}_{j=1}^{4}$ that are based on different attention factors, as

normalized by $\sum_{k\in\Omega_{q}}A^{\text{Trans}}_{m}(q,k,z_{q},x_{k})=1$ where the supporting key region $\Omega_{q}$ spans the key elements (e.g., the whole input sentence). By default, $8$ attentional heads are utilized in this paper.

The $\mathcal{E}_{1}$ and $\mathcal{E}_{2}$ terms are sensitive to the query content. The $\mathcal{E}_{1}$ term measures the compatibility of the query and key content, as $\mathcal{E}_{1}=z_{q}^{\top}U_{m}^{\top}V^{\rm C}_{m}x_{k}$ , where $U_{m}$ , $V^{\rm C}_{m}$ are learnable embedding matrices for the query and key content, respectively. It enables the network to focus more on the keys compatible with the query in terms of content. A possible outcome is the correspondence between similar query and key elements, as illustrated in Fig. 1 (a). For the $\mathcal{E}_{2}$ term, it is based on the query content and relative position, as $\mathcal{E}_{2}=z_{q}^{\top}U_{m}^{\top}V^{\rm R}_{m}R_{k-q}$ , where $R_{k-q}$ encodes the relative position $k-q$ by projecting it to a high-dimensional representation through computing sine and cosine functions of different wavelengthsFor 2-d image data, we separately encode the x-axis relative position $R^{\text{X}}_{k-q}$ and y-axis relative position $R^{\text{Y}}_{k-q}$ , and concatenate them to be the final encoding $R_{k-q}=[R^{\text{X}}_{k-q},R^{\text{Y}}_{k-q}]$ . . $V^{\rm R}_{m}$ is a learnable embedding matrix for the encoded relative position $R_{k-q}$ . This term allows the network to adaptively determine where to assign high attention weights based on the query content. It may help to disentangle appearance from spatial transformations in image recognition, as illustrated in Fig. 1 (b).

The $\mathcal{E}_{3}$ and $\mathcal{E}_{4}$ terms are irrelevant to the query content. The $\mathcal{E}_{3}$ term involves key content only, as $\mathcal{E}_{3}=u_{m}^{\top}V^{\rm C}_{m}{x_{k}}$ , where $u_{m}$ is a learnable vector. It captures salient key content which should be focused on for the task, and is irrelevant to the query. An illustration is shown in Fig. 1 (c). As for the $\mathcal{E}_{4}$ term, it involves relative position only, as $\mathcal{E}_{4}=v_{m}^{\top}V^{\rm R}_{m}{R_{k-q}}$ , where $v_{m}$ is a learnable vector. It captures global positional bias between the key and query elements, as illustrated in Fig. 1 (d).

It is widely believed that query-sensitive prioritization, especially the query and key content compatibility term $\mathcal{E}_{1}$ , is the key to the success of Transformer attention. Thus, in some recent variants , only $\mathcal{E}_{1}$ is kept, while the other terms are all removed.

In Transformer attention, both $W_{m}$ and $W^{\prime}_{m}$ in Eq. (1) are learnable. $W^{\prime}_{m}$ projects the features of $x_{k}$ to a relatively low dimension for reducing computational overhead, and $W_{m}$ projects the aggregated features back to the same dimension as $y_{q}$ .

Regular and deformable convolution can be deemed as special instantiations of spatial attention mechanisms, where subsets of the attention factors are involved.

In regular convolution, given a query element, a fixed number of key elements (e.g., $3\times 3$ ) are sampled, according to predetermined positional offsets with respect to the query. From the perspective of Eq. (1), the attention weight of regular convolution can be expressed as

where each sampled key element is of a separate attention head (e.g., $3\times 3$ regular convolution corresponds to 9 attention heads), and $p_{m}$ denotes the offset for the $m$ -th sampling position. In addition, the weight $W_{m}^{\prime}$ in Eq. (1) is fixed as identity, leaving $W_{m}$ as learnable. In regular convolution, only relative position is involved, without learnable parameters for adapting attention to content. The supporting key region $\Omega_{q}$ is restricted to a local window centered at the query position and determined by the convolution kernel size.

In deformable convolution , learnable offsets are added to adjust the sampling positions of the key elements, so as to capture spatial transformations. The learnable offsets are predicted based on the query content, and are thus dynamic to the input. The key and the query elements are from the same set. It can also be incorporated into the generalized attention formulation as a special instantiation of self-attention, where the attention weight is

where $p_{m}$ also denotes a predetermined offset, and $w_{m}^{\top}x_{q}$ projects the query content $x_{q}$ to a deformation offset according to a learnable vector $w_{m}$ Following , the learning rate of $w_{m}$ is set to 0.1 times that of other parameters to stabilize training.. $G(a,b)$ is the bilinear interpolation kernel in $N$ -d space, which can be decomposed into 1-d bilinear interpolations as $G(a,b)=\prod_{n=1}^{N}g(a_{n},b_{n})$ , where $a_{n}$ and $b_{n}$ denote the $n$ -th dimension of $a$ and $b$ respectively, and $g(a_{n},b_{n})=\max(0,1-|a_{n}-b_{n}|)$ . Similar to regular convolution, the weight $W_{m}^{\prime}$ in Eq. (1) is fixed as identity.

In deformable convolution, the attention factors are query content and relative position. The supporting key region $\Omega_{q}$ can span over all the input elements due to the introduced learnable offsets, while non-zero weights are assigned to a sparse set of key elements where bilinear interpolation is performed.

Dynamic convolution is recently proposed to replace the Transformer attention module in self-attention, and is claimed to be simpler and more efficient. It is built upon depth-wise separable convolution with shared dynamic kernel weights, which are predicted based on the query content. In depth-wise separable convolution, a standard convolution is factorized into a depth-wise convolution and a $1\times 1$ convolution called a point-wise convolution, for reducing computation and model size. In depth-wise convolution, a single filter is applied to each input channel, which is fixed for all positions. In dynamic convolution, the kernel weights for the depth-wise convolution are dynamically predicted from the input features, followed by a Softmax normalization. For computational savings, the input channels are divided into several groups, where each group shares the same dynamic kernel weights. In the system of , an orthogonal module called the gated linear unit (GLU) is applied before the dynamic convolution module to improve accuracy. We include the GLU to respect the original design.

Dynamic convolution can also be incorporated into the general attention formulation in Eq. (1) with minor modifications, where each input feature channel is of a separate attention head. It can be expressed as

where $c$ enumerates the channels of the input features ( $C_{\text{in}}$ channels in total), $x_{k,c}$ denotes the feature value at the $c$ -th channel of $x_{k}$ , and $W_{c}$ is of the $1\times 1$ point-wise convolution. $A^{\text{dynamic}}_{c}(q,k,x_{q})$ is the attention weight specified by the dynamic kernel in depth-wise convolution, written as

where $p_{j}$ denotes the $j$ -th sampling position in the dynamic kernel, and $K_{j,c}$ is the corresponding kernel weight. Zero attention weight is assigned to keys outside of the kernel. The kernel weight $K_{j,c}$ is predicted from the input features, and is shared among channels in the same group, as

The input features are divided into $N_{g}$ groups ( $N_{g}=16$ by default). $K_{j,g}^{\text{share}}$ denotes the dynamic kernel weight for the $g$ -th group, and $d_{j,g}$ is the corresponding learnable weight vector. $K_{j,g}^{\text{share}}$ is normalized by $\sum_{j=1}^{N_{k}}K_{j,g}^{\text{share}}=1$ , where $N_{k}$ denotes the number of elements in the dynamic kernel.

In dynamic convolution, attention assignment is based on the query content and relative position factor. The supporting key region $\Omega_{q}$ is restricted to a local window around the query position covered by the dynamic kernel.

Tab. 1 compares the three attention mechanisms discussed above. Transformer attention exploits comprehensive content and position information from both query and key. The $\mathcal{E}_{1}$ , $\mathcal{E}_{2}$ and $\mathcal{E}_{4}$ terms require computation proportional to the product of the query and key element numbers, because they involve a traversal of each query-key pair. The $\mathcal{E}_{3}$ term captures key content only, and thus involves computation linear to the key element number. In neural machine translation, the key and query elements are commonly dozens of words in a sentence, so the computational overheads of $\mathcal{E}_{1}$ , $\mathcal{E}_{2}$ and $\mathcal{E}_{4}$ are comparable to $\mathcal{E}_{3}$ . In image recognition, the key and query elements consist of numerous pixels in an image. The computational overheads of $\mathcal{E}_{1}$ , $\mathcal{E}_{2}$ and $\mathcal{E}_{4}$ are thus much heavier than $\mathcal{E}_{3}$ . Note that when the four terms are put together, some computational overhead can be shared among them.

Similar to the $\mathcal{E}_{2}$ term, deformable convolution also is based on query content and relative position. But deformable convolution samples just a sparse set of key elements for each query, and the complexity is linear to the query element number. Deformable convolution is thus much faster to compute than $\mathcal{E}_{2}$ for image recognition, and is comparable in speed to $\mathcal{E}_{2}$ for machine translation.

Dynamic convolution also relies on query content and relative position. The attention weights of key elements are assigned by the dynamic convolution kernel, based on the query content. Non-zero attention weights only exist in a local range covered by the dynamic kernel. The computational overhead is proportional to the product of the kernel size and query element number. Compared to the $\mathcal{E}_{2}$ term, the computational overhead can be considerably lower if the kernel size is much smaller than the key element number.

We seek to further disentangle the effects of different attention factors, and to facilitate comparison to other instantiations of spatial attention that use a subset of the factors. Thus, manual switches are introduced into the Transformer attention module, which enable us to manually activate / deactivate particular terms. This is expressed as

where $\{\beta^{\text{Trans}}_{j}\}$ takes values in $\{0,1\}$ to control the activation of corresponding terms, and $\hat{A}^{\text{Trans}}_{m}(q,k,z_{q},x_{k})$ is normalized by $\sum_{k\in\Omega_{q}}\hat{A}^{\text{Trans}}_{m}(q,k,z_{q},x_{k})=1$ .

Incorporating attention modules into deep networks

We incorporate various attention mechanisms into deep networks to study their effects. There are different design choices in inserting the modules, e.g., whether to connect them in series or in parallel, and where to place the modules in the backbone network. We empirically observed the results to be quite similar for different well-considered designs. In this paper, we select the design choices in Fig. 2.

For the object detection and semantic segmentation tasks, ResNet-50 is chosen as the backbone and just the self-attention mechanism is involved. The Transformer attention module is incorporated by applying it on the $3\times 3$ convolution output in the residual block. For insertion into a pre-trained model without breaking the initial behavior, the Transformer attention module includes a residual connection, and its output is multiplied by a learnable scalar initialized to zero, as in . The manner of incorporating dynamic convolution is the same. To exploit deformable convolution, the $3\times 3$ regular convolution in the residual block is replaced by its deformable counterpart. The resulting architecture is called “Attended Residual Block”, shown in Fig. 2 (a).

In the neuron machine translation (NMT) task, the network architecture follows the Transformer base model , where both self-attention and encoder-decoder attention mechanisms are involved. Different from the original paper, we update the absolute position embedding in the Transformer attention module by the latest relative position version as in Eq. 2. Because both deformable convolution and dynamic convolution capture self-attention, they are added to only the blocks capturing self-attention in Transformer. For dynamic convolution, we replace the Transformer attention module by dynamic convolution directly, as in . The architecture is shown in Fig. 2 (b). For its deformable convolution counterpart, because the Transformer model does not utilize any spatial convolution (with kernel size larger than 1), we insert the deformable convolution unit (with kernel size of 3) prior to the input of the Transformer attention module. The resulting architecture is called “Transformer + Deformable”, shown in Fig. 2 (c).

Experiments and Analysis

Models are trained on the 118k images of the COCO 2017 train set. Evaluation is done on the 5k images of the COCO 2017 validation set. Accuracy is measured by the standard mean AP scores at different box IoUs (mAP).

Faster R-CNN with Feature Pyramid Networks (FPN) is chosen as the baseline system. ImageNet pre-trained ResNet-50 is utilized as the backbone. The attended residual blocks in Fig. 2 (a) are applied in the last two stages (conv4 and conv5 stages) of ResNet-50. In Transformer attention, the relative position encoding is of the same dimension as the content feature embedding, specifically 256-d and 512-d in the conv4 and conv5 stages, respectively.

Experiments are implemented based on the open source mmdetection code base. The hyper-parameter setting strictly follows FPN . Anchors of 5 scales and 3 aspect ratios are utilized. 2k and 1k region proposals are generated at a non-maximum suppression threshold of 0.7 at training and inference respectively. In SGD training, 256 anchor boxes (of positive-negative ratio 1:1) and 512 region proposals (of positive-negative ratio 1:3) are sampled for backpropagating their gradients. In our experiments, the networks are trained on 8 GPUs with 2 images per GPU for 12 epochs. The learning rate is initialized to 0.02 and is divided by 10 at the 8-th and the 11-th epochs. The weight decay and the momentum parameters are set to $10^{-4}$ and 0.9, respectively.

Models are trained on the 5,000 finely annotated images of the Cityscapes train set. Evaluation is done on the 500 images of the validation set. The standard mean IoU score (mIoU) is used to measure semantic segmentation accuracy.

The CCNet for semantic segmentation is utilized, with ImageNet pre-trained ResNet-50 and without the criss-cross attention module proposed in , which is a variant of Transformer attention. As done for object detection, the attended residual blocks in Fig. 2 (a) are applied in the last two stages. An additional Transformer attention / dynamic convolution module is placed after the ResNet-50 output following the practice in for improving performance.

The hyper-parameter setting strictly follows that in the CCNet paper . In SGD training, the training images are augmented by randomly scaling (from 0.7 to 2.0), randomly cropping (size of 769 $\times$ 769 pixels) and random flipping horizontally. In our experiments, the networks are trained on 8 GPUs with 1 image per GPU for 60k iterations. The “poly” learning rate policy is employed, where the initial learning rate is set as 0.005 and multiplied by $\big{(}1-\frac{\text{iter}}{\text{iter}_{\text{max}}}\big{)}^{0.9}$ . Synchronized Batch Normalization is placed after every newly added layer with learnable weights. The weight decay and the momentum parameters are set as $10^{-4}$ and 0.9, respectively.

Model training is conducted on the standard WMT 2014 English-German dataset, consisting of about 4.5 million sentence pairs. Sentences are encoded using byte-pair encoding , with a shared source-target vocabulary of about 37k tokens. Evaluation is on the English-to-German newstest2014 set. Accuracy is measured by the standard bilingual evaluation understudy (BLEU) scores .

The Transformer base model with relative position encoding is utilized as the backbone. Experiments are implemented based on the open source fairseq code base. The hyper-parameters follows the original setting in . We used the Adam optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.98$ and $\epsilon=10^{-9}$ . In our experiments, the networks are trained on 8 GPUs for 100k iterations. Each training batch contained a set of sentence pairs containing approximately 30k source tokens and 30k target tokens. The initial learning rate is set as $10^{-7}$ and linearly increased to 0.001 after $\text{iter}_{\text{warmup}}=4000$ iterations, and then multiplied by $\frac{\text{iter}}{\text{iter}_{\text{warmup}}}^{-0.5}$ . No weight decay is adopted. During training, label smoothing of value 0.1 is employed.

2 Effects of various attention-based modules

We first seek to disentangle the effects of the four terms in the Transformer attention module. This is achieved by manually setting the $\{\beta^{\text{Trans}}_{j}\}_{j=1}^{4}$ values in Eq. (8) to control the activation / deactivation of individual terms. The network is trained and tested for all 16 possible configurations of $\{\beta^{\text{Trans}}_{j}\}_{j=1}^{4}$ . In this set of experiments, no other attention mechanisms are involved. Thus, for the object detection and semantic segmentation tasks, the $3\times 3$ convolution is of regular convolution in the network of Fig. 2 (a). For the NMT task, the network architecture in Fig. 2 (b) is utilized. Transformer attention is used in the choices of “Transformer attention / Dynamic convolution” in Fig. 2 (a) and (b). Note that for the NMT task, Transformer attention modules are utilized for both self-attention and encoder-decoder attention. To reduce experimental complexity, the Transformer attention modules in encoder-decoder attention are kept as their full version ( $\beta^{\text{Trans}}_{j}=1,j=1,\ldots,4$ , abbreviated as configuration “1111” here) when we study self-attention.

Fig. 3 plots the accuracy-efficiency tradeoffs of different $\{\beta^{\text{Trans}}_{j}\}_{j=1}^{4}$ configurations, where the accuracy-efficiency envelopes are indicated by connected line segments. Note that only the computational overheads from the Transformer attention modules under study are counted here, without the overheads from other parts of the network. From the plot, we draw the following conclusions:

(1) In self-attention, the query-sensitive terms play a minor role compared to the query-irrelevant terms. Especially, the query and key content term have a negligible effect on accuracy, while being computationally heavy in image recognition tasks. Overall, the accuracy gain brought by the Transformer attention module is large (from the configuration where the Transformer attention module is removed (“w/o”) to that where the full version of Transformer attention is utilized (“1111”)). It can be seen that the gain brought by the query-irrelevant terms (from configuration “w/o” to “0011”) is much larger than that brought by the query-sensitive terms (from configuration “0011” to “1111”). Particularly, the performance gain brought by the query and key content term (controlled by $\beta^{\text{Trans}}_{1}$ ) is negligible. Removing it (from configuration “1111” to “0111”) incurs only a tiny drop in accuracy, while considerably reducing the computational overhead in image recognition tasks.

(2) In encoder-decoder attention, the query and key content term is vital. Deactivation of it (controlled by $\beta^{\text{Trans}}_{1}$ ) incurs a noticeable drop in accuracy, while only utilizing the query and key content term (configuration “1000”) delivers accuracy almost the same as the full version (configuration “1111”). This is because the key step in NMT is to align the words in the source and the target sentences. A traversal of the query and key content is essential for such alignment.

(3) In self-attention, the attention factors of query content & relative position and the key content only are most important. The corresponding configuration “0110” delivers accuracy very close to the full version (configuration “1111”), while saving a considerable amount of computational overhead in image recognition tasks. It is also worth noting that the key content only term, which captures saliency information, can effectively improve the performance with little additional overhead.

Our findings contradict the widespread belief that query-sensitive terms, especially the query and key content term, are crucial for the success of Transformer attention. The experimental results suggest that this is only true for the encoder-decoder attention scenario. In self-attention scenarios, the query and key content term is even removable.

Deformable convolution vs. $\mathcal{E}_{2}$ in Transformer attention

Here, we compare deformable convolution and the $\mathcal{E}_{2}$ term from Transformer attention in Eq. (2). Because deformable convolution is designed for capturing self-attention, we restrict the experiments to self-attention scenarios only. Note that when deformable convolution is utilized in the NMT task, the network architecture is of “Transformer + Deformable” in Fig. 2 (c).

Tab. 2 compares deformable convolution and the $\mathcal{E}_{2}$ term in a variety of settings. We find that:

(1) For object detection and semantic segmentation, deformable convolution considerably surpasses the $\mathcal{E}_{2}$ term in both accuracy and efficiency. While for NMT, deformable convolution is on par with the $\mathcal{E}_{2}$ term in both accuracy and efficiency. In terms of efficiency, deformable convolution does not need to traverse all the key elements. This advantage is obvious on images, where numerous pixels are involved. In terms of accuracy, the bilinear sampling in deformable convolution is based on the hypothesis of local linearity of feature maps. This hypothesis holds better on images where local image content changes gradually, than on languages where words change abruptly.

(2) The combination of deformable convolution and the key content only term (“0010 + deformable”) delivers the best accuracy-efficiency tradeoff. The accuracy is on par with using deformable convolution and the whole attention module (“1111 + deformable”), while the overhead is slightly higher than that of deformable convolution only (“w/o + deformable”). This finding is in line with finding (3) of “Disentanglement in Transformer attention”. It further suggests the importance of the query content & relative position and key content only factors in self-attention. The configuration “0010 + deformable” is also plotted in Fig. 3.

Dynamic convolution vs. $\mathcal{E}_{2}$ in Transformer attention

We compare these two instantiations in self-attention scenarios. The network architectures are of Fig. 2 (a) for image recognition tasks, and of Fig. 2 (b) for NMT, where either the Transformer attention with $\mathcal{E}_{2}$ only (configuration “0100”) or dynamic convolution is utilized.

Tab. 3 presents the results. We can find that for NMT, dynamic convolution achieves accuracy on par with the $\mathcal{E}_{2}$ term at reduced computational cost. However, dynamic convolution is not effective for object detection and semantic segmentation, delivering considerably lower accuracy. To further study the influence of kernel size in dynamic convolution, we also constrain the spatial range of the $\mathcal{E}_{2}$ term to be the same as that in dynamic convolution. The accuracy drops as the spatial range shrinks for both dynamic convolution and the $\mathcal{E}_{2}$ term. But it is worth noting that the $\mathcal{E}_{2}$ term still surpasses dynamic convolution at the same spatial range in image recognition tasks, with even smaller computational overhead. The inferior accuracy of dynamic convolution in image recognition tasks might be because dynamic convolution is originally designed for NMT, and some design choices may not be suitable for image recognition.

Introduction

Related Work

Study of Spatial Attention Mechanisms

Experiments and Analysis

2 Effects of various attention-based modules

References