$A^2$-Nets: Double Attention Networks

Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, Jiashi Feng

Introduction

Deep Convolutional Neural Networks (CNNs) have been successfully applied in image and video understanding during the past few years. Many new network topologies have been developed to alleviate optimization difficulties he2016deep ; he2016identity and increase the learning capacities xie2017aggregated ; chen2017dual , which benefit recognition performance for both images girshick2015fast ; chen2016deeplab and videos tran2017closer significantly.

However, CNNs are inherently limited by their convolution operators which are dedicated to capturing local features and relations, e.g. from a $7\times 7$ region, and are inefficient in modeling long-range interdependencies. Though stacking multiple convolution operators can enlarge the receptive field, it also comes with a number of unfavorable issues in practice. First, stacking multiple operators makes the model unnecessarily deep and large, resulting in higher computation and memory cost as well as increased over-fitting risks. Second, features far away from a specific location have to pass through a stack of layers before affecting the location for both forward propagation and backward propagation, increasing the optimization difficulties during the training. Third, the features visible to a distant location are actually “delayed” ones from several layers behind, causing inefficient reasoning. Though some recent works hu2017 ; wang17non can partially alleviate the above issues, they are either non-flexible hu2017 or computationally expensive wang17non .

In this work, we aim to overcome these limitations by introducing a new network component that enables a convolution layer to sense the entire spatio-temporal spaceHere by “space” we mean the entire feature maps of an input frame and the complete spatio-temporal features from a video sequence. from its adjacent layer immediately. The core idea is to first gather key features from the entire space into a compact set and then distribute them to each location adaptively, so that the subsequent convolution layers can sense features from the entire space even without a large receptive filed. We develop a generic function for such purpose and implement it with an efficient double attention mechanism. The first second-order attention pooling operation selectively gathers key features from the entire space, while the second adopts another attention mechanism to adaptively distribute a subset of key features that are helpful to complement each spatio-temporal location for high-level tasks. We denote our proposed double-attention block as $A^{2}$ -block and its resultant network as $A^{2}$ -Net.

The double-attention block is related to a number of recent works, including the Squeeze-and-Excitation Networks hu2017 , covariance pooling li2017second , the Non-local Neural Networks wang17non and the Transformer architecture of vaswani2017attention . However, compared with these existing works, it enjoys several unique advantages: Its first attention operation implicitly computes second-order statistics of pooled features and can capture complex appearance and motion correlations that cannot be captured by the global average pooling used in SENet hu2017 . Its second attention operation adaptively allocates features from a compact bag, which is more efficient than exhaustively correlating the features from all the locations with every specific location as in wang17non ; vaswani2017attention . Extensive experiments on image and video recognition tasks clearly validate the above advantages of our proposed method.

We summarize our contributions as follows:

We propose a generic formulation for capturing long-range feature interdependencies via universal gathering and distribution functions.

We propose the double attention block for gathering and distributing long-range features, an efficient architecture that captures second-order feature statistics and makes adaptive feature assignment. The block can model long-range interdependencies with a low computational and memory footprint and at the same time boost image/video recognition performance significantly.

We investigate the effect of our proposed $A^{2}$ -Net with extensive ablation studies and prove its superior performance through comparison with the state-of-the-arts on a number of public benchmarks for both image recognition and video action recognition tasks, including ImageNet-1k, Kinetics and UCF-101.

The rest of the paper is organized as follows. We first motivate and present our approach in Section 2, where we also discuss the relation of our approach to recent works. We then evaluate and report results in Section 3 and conclude the paper with Section 4.

Method

Convolutional operators are designed to focus on local neighborhoods and therefore fail to “sense” the entire spatial and/or temporal space, e.g. the entire input frame or one location across multiple frames. A CNN model thus usually employs multiple convolution layers (or recurrent units donahue2015long ; ng2015beyond ) in order to capture global aspects of the input. Meanwhile, self-attentive and correlation operators like second-order pooling have been recently shown to work well in a wide range of tasks vaswani2017attention ; li2017second ; lin2015bilinear . In this section we present a component capable of gathering and distributing global features to each spatial-temporal location of the input, helping subsequent convolution layers sense the entire space immediately and capture complex relations. We first formally describe this desired component by providing a generic formulation and then introduce our double attention block, a highly efficient instantiation of such a component. We finally discuss the relation of our approach to other recent related approaches.

The idea of gathering and distributing information is motivated by the squeeze-and-excitation network (SENet) hu2017 . Eqn. (1), however, presents it in a more general form that leads to some interesting insights and optimizations. In hu2017 , global average pooling is used in the gathering process, while the resulted single global feature is distributed to all locations, ignoring different needs across locations. Seeing these shortcomings, we introduce this genetic formulation and propose the Double Attention block, where global information is first gathered by second-order attention pooling (instead of first-order average pooling), and the gathered global features are adaptively distributed conditioned on the need of current local feature $\mathbf{v}_{i}$ , by a second attention mechanism. In this way, more complex global relations can be captured by a compact set of features and each location can receive its customized global information that is complementary to the exiting local features, facilitating learning more complex relations. The proposed component is illustrated in Figure 1 (a). At below, we first describe its architecture in details and then discuss some instantiations and its connections to other recent related approaches.

A recent work lin2015bilinear used bilinear pooling to capture second-order statistics of features and generate global representations. Compared with the conventional average and max pooling which only compute first-order statistics, bilinear pooling can capture and preserve complex relations better. Concretely, bilinear pooling gives a sum pooling of second-order features from the outer product of all the feature vector pairs $(\mathbf{a}_{i},\mathbf{b}_{i})$ within two input feature maps $A$ and $B$ :

2 The Second Attention Step: Feature Distribution

The next step after gathering features from the entire space is to distribute them to each location of the input, such that the subsequent convolution layer can sense the global information even with a small convolutional kernel.

3 The Double Attention Block

We combine the above two attention steps to form our proposed double-attention block, with its computation graph in deep neural networks is given in Figure 2. To formulate the double attention operation, we substitute Eqn. (4) and Eqn. (5) into Eqn. (1) and obtain

There are two different ways to implement the computational graph of Eqn. (6). One is to use the left association as given in Eqn. (6) with computation graph is shown in Figure 2. The other is to conduct the right association, as formulate below:

We note these two different associations are mathematically equivalent and thus will produce the same output. However, they have different computational cost and memory consumption. The computational complexity of the second matrix multiplication in “left association” in Eqn. (6) is $\mathcal{O}(mndhw)$ , while “right association” in Eqn. (7) has complexity of $\mathcal{O}(m(dhw)^{2})$ . As for the memory costAll values are stored in 32-bit float., storing the output of the results of the first matrix multiplication costs $mn/2^{18}$ MB and $(dhw)^{2}/2^{18}$ MB for the left and right associations respectively. In practice, an input data array $X$ with $32$ $28\times 28$ frames and $512$ channel size can easily cost more than $2$ GB memory when adopting the right association, much more expensive than $1$ MB cost of the left association. In this case, left association is also more computationally efficient than the right one. Therefore, for common cases where $(dhw)^{2}>nm$ , we suggest implementation in Eqn. (6) with left association.

4 Discussion

Experiments

In this section, we first conduct extensive ablation studies to evaluate the proposed $A^{2}$ -Nets on the Kinetics kay2017kinetics video recognition dataset and compare it with the state-of-the-art NL-Net wang17non . Then we conduct more experiments using deeper and wider neural networks on both image recognition and video recognition tasks and compare it with state-of-the-art methods.

We use the residual network he2016identity as our backbone CNN for all experiments. Table 1 shows architecture details of the backbone CNNs for video recognition tasks, where we use ResNet-26 for all ablation studies and ResNet-29 as one of the baseline methods. The computational cost is measured by FLOPs, i.e. floating-point multiplication-adds, and the model complexity is measured by #Params, i.e. total number of trained parameters. The ResNet-50 is almost $2\times$ deeper and wider than the ResNet-26 and thus only used for last several experiments when comparing with the state-of-the-art methods. For the image recognition task, we use the same ResNet-50 but without the temporal dimension for both the input/output data and convolution kernels.

Training and Testing Settings

We use MXNet chen2015mxnet to experiment on the image classification task, and PyTorch paszke2017pytorch on video classification tasks. For image classification, we report standard single model single $224\times 224$ center crop validation accuracy, following he2016deep ; he2016identity . For experiments on video datasets, we report both single clip accuracy and video accuracy. All experiments are conducted using a distributed K80 GPU cluster and the networks are optimized by synchronized SGD. Code and trained models will be released on GitHub soon.

2 Ablation Studies

For the ablation studies on Kinetics carreira2017quo , we use 32 GPUs per experiment with a total batch size of 512 training from scratch. All networks take 16 frames with resolution $112\times 112$ as input. The base learning rate is set to $0.2$ and is reduced with a factor of $0.1$ at the $20$ k-th, $30$ k-th iterations, and terminated at the $37$ k-th iteration. We set the number of output channels for three convolution layers $\theta(\cdot)$ , $\phi(\cdot)$ and $\rho(\cdot)$ to be $1/4$ of the number of input channels. Note that sub-sampling trick is not adopted for all methods for fair comparison.

Table 2 shows the results when only one extra block is added to the backbone network. The block is placed after the second residual unit of a certain stage. As can be seen from the last three rows, our proposed $A^{2}$ -block constantly improves the performance compared with both the baseline ResNet-26 and the deeper ResNet-29. Notably the extra cost is very little. We also find that the performance gain from placing $A^{2}$ -block on top layers is more significant than placing it at lower layers. This may be because the top layers give more semantically abstract representations that are suitable for extracting global visual primitives. Comparatively, the Nonlocal Network wang17non shows less accuracy gain and more computational cost than ours. Since the computational cost for Nonlocal Network is increased quadratically on bottom stage, we are even unable to finish the training when the block is placed at Conv2.

Multiple Blocks

Table 3 shows the performance gain when multiple blocks are added to the backbone networks. As can be seen from the results, our proposed $A^{2}$ -Net monotonically improves the accuracy when more blocks are added and costs less #FLOPs compared with its competitor. We also find that adding blocks to different stages can lead to more significant accuracy gain than adding all blocks to the same stage.

3 Experiments on Image Recognition

We evaluate the proposed $A^{2}$ -Net on ImageNet-1k krizhevsky2012imagenet image classification dataset, which contains more than 1.2 million high resolution images in $1,000$ categories. Our implementation is based on the code released by chen2017dual using $64$ GPUs with a batch size of $2,048$ . The base learning rate is set to $\sqrt{0.1}$ and decreases with a factor of $0.1$ when training accuracy is saturated.

As can be seen from Table 5, a ResNet-50 equipped with 5 extra $A^{2}$ -blocks at Conv3 and Conv4 outperforms a much larger ResNet-152 architecture. We note that the $A^{2}$ -blocks embedded ResNet-50 is also over 40% more efficient than ResNet-152 and only costs $6.5$ GFLOPs and $33.0$ M parameters. Compared with the SENet hu2017 , the $A^{2}$ -Net also achieves better accuracy which proves the effectiveness of the proposed double attention mechanism.

4 Experiment Results on Video Recognition

In this subsection, we evaluate the proposed method on learning video representations. We consider the scenario where static image features are pretrained but motion features are learned from scratch by training a model on the large-scale Kinetics carreira2017quo dataset, and the scenario where well-trained motion features are transfered to small-scale UCF-101 soomro2012ucf101 dataset.

We use ResNet-50 pretrained on ImageNet and add 5 randomly initialized $A^{2}$ -blocks to build the 3D convolutional network. The corresponding backbone is shown in Table 1. The network takes 8 frames (sampling stride: 8) as input and is trained for $32k$ iterations with a total batch size of $512$ using $64$ GPUs. The initial learning rate is set to $0.04$ and decreased in a stepwise manner when training accuracy is saturated. The final result is shown in Table 5. Compared with the state-of-the-art I3D carreira2017quo and R(2+1)D tran2017closer , our proposed model shows higher accuracy even with a less number of sampled frames, which once again confirms the superiority of the proposed double-attention mechanism.

Transfer the Learned Feature to UCF-101

The UCF-101 contains about $13,320$ videos from 101 action categories and has three train/test splits. The training set of UCF-101 is several times smaller than the Kinetics dataset and we use it to evaluate the generality and robustness of the features learned by our model pre-trained on Kinetics. The network is trained with a base learning rate of $0.01$ which is decreased for three times with a factor $0.1$ , using 8 GPUs with a batch size of 104 clips and tested with $224\times 224$ input resolution on single scale. Table 6 shows results of our proposed model and comparison with state-of-the-arts. Consistent with above results, the $A^{2}$ -Net achieves leading performance with significantly lower computational cost. This shows that the features learned by $A^{2}$ -Net are robust and can be effectively transfered to new dataset in very low cost compared with existing methods.

Conclusions

In this work, we proposed a double attention mechanism for deep CNNs to overcome the limitation of local convolution operations. The proposed double attention method effectively captures the global information and distributes it to every location in a two-step attention manner. We well formulated the proposed method and instantiated it as an light-weight block that can be easily inserted into to existing CNNs with little computational overhead. Extensive ablation studies and experiments on a number of benchmark datasets, including ImageNet-1k, Kinetics and UCF-101, confirmed the effectiveness of the proposed $A^{2}$ -Net on both 2D image recognition tasks and 3D video recognition tasks. In the future, we want to explore integrating the double attention in recent compact network architectures sandler2018inverted ; ma2018shufflenet ; chen2018multifiber , to leverage the expressiveness of the proposed method for smaller, mobile-friendly models.