Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

Gao Peng, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven Hoi, Xiaogang Wang, Hongsheng Li

Introduction

Visual Question Answering aims at automatically answering a natural language question related to the contents of a given image. It has extensive applications in practice, such as assisting blind people and education of young children, and therefore become a hot research topic recently. The performance of Visual Question Answering (VQA) has been substantially improved in recent years thanks to three lines of works. Firstly, better visual and language feature representations are at the core for boosting VQA performance. The feature learning capability from VGG , ResNet , FishNet to the recent bottom-up & top-down features increases the VQA performance significantly. Secondly, different variants of attention mechanisms can adaptively select important features which can help deep learning achieve better recognition accuracy. Thirdly, better multi-modality fusion approaches, such Bilinear Fusion , MCB and MUTAN , have been proposed for better capturing the high-level interactions between language and visual features.

Despite being studied extensively, most existing VQA approaches focus on learning inter-modality relations between visual and language features. Bilinear feature fusion approaches focus on capturing the higher order relations between language and visual modalities by feature outer product. Co-attention or bilinear attention-based approaches learn the inter-modality relations between word-region pairs to identify key pairs for question answering. On the other hand, there exist computer vision and natural language processing algorithms focusing on learning intra-modality relations. Hu et al. proposed to explore intra-modality object-to-object relations to boost object detection accuracy. Yao et al. modeled intra-modality object-to-object relations for improving image captioning performance. In the recently proposed BERT algorithm for natural language processing, intra-modality word relations are modelled by self-attention mechanism to learn state-of-the-art word embedding. However, the inter- and intra-modality relations were never jointly investigated in a unified framework for solving the VQA problem. We argue that, for the VQA problem, the intra-modality relations within each modality is complementary to the inter-modality relations, which were mostly ignored by existing VQA methods. For instance, for the image modality, each image region should obtain information not only from its associate words/phrases in the question but also from related image regions to infer the answer of the question. For the question modality, better understanding of question can be acquired by inferring other words. Such cases motivate us to propose a unified framework for modelling both inter- and intra-modality information flow.

To overcome the limitations, we propose a novel Dynamic Fusion with Intra- and Inter-modality Attention Flow (DFAF) framework for efficient multi-modality feature fusion to accurately answer visual questions. The overall diagram is shown in Figure 1. Our DFAF framework integrates cross-modal self-attention and cross-modal co-attention mechanisms to achieve effective information flows within and between the image and language modalities. Given visual and question features encoded by deep neural networks, the DFAF framework first generates inter-modality attention flow (InterMAF) to pass information between image and language. In the InterMAF module, visual and language features generate a joint-modality co-attention matrix. Each visual region would select question features according to the joint-modality co-attention matrix and vice versa. The InterMAF module fuses and updates each image region and each word’s features according to the attention-weighted information flows from the other modality. Following the InterMAF module, DFAF calculates the dynamic intra-modality attention flow (DyIntraMAF) for passing information flows within each modality to capture the complex intra-modality relations. Visual regions and sentence words generate self-attention weights and aggregate attention-weighted information from other instances in the same modality. More importantly, although the information are only propagated within the same modalities, information of the other modality is considered and used to modulate intra-modality attention weights and flows. With such an operation, the attention flows within each modality are dynamically conditioned on the other modality and is the key difference compared with existing intra-modality message passing methods on object detection and image captioning . DyIntraMAF is shown to be substantially better than its variant using only internal information for intra-modality information flow and is the key to the success of the proposed framework. We alternatively use InterMA and DyIntraMA modules to create the basic blocks of the DFAF. Multiple stacks of DFAF blocks are shown to further improve the VQA performance.

Our contributions can be summarized into threefold. (1) A novel Dynamic Fusion with Intra- and Inter-modality Attention Flow (DFAF) framework is proposed for multi-modality fusion by interleaving intra- and inter-modality feature fusion. Such a framework for the first time integrates inter-modality and dynamic intra-modality information flow in a unified framework for tackling the VQA task. (2) Dynamic Intra-modality Attention Flow (DyIntraMAF) module is proposed for generating effective attention flows within each modality, which are dynamically conditioned on the information of the other modality. It is one of the core novelties of our proposed framework. (3) Extensive experiments and ablation studies are performed to examine the effectiveness of the proposed DFAF framework, in which state-of-the-art VQA performance is achieved by our proposed DFAF framework.

Related Work

Representation learning for VQA. The recent boost of VQA performance is due to the success of deep representation learning. In the early stage of VQA methods, the VGG network was commonly used. With the introduction of ResNet , the VQA community shift to ResNet networks, which outperform VGG by large margins. Recently, the bottom-up and top-down network derived from faster RCNN are shown to be suitable for VQA and image captioning tasks. Feature learning is an essential component for the development of VQA algorithms.

Bilinear Fusion for VQA. Solving VQA requires understanding of visual and language contents as well as the relation between them. In early VQA methods, simple concatenation or element-wise multiplication between visual and language are used for cross-modal feature fusion. To capture the high-level interactions between the two modalities, Bilinear Fusion has been proposed to adopt bilinear pooling to fuse features from the two modalities. To overcome the limitation of high computational cost of bilinear pooling, many approximated fusion methods, including MCB , MLB and MUTAN , were proposed, which have shown better performance than bilinear fusion with much fewer parameters.

Self-attention-based methods. The attention mechanism in deep learning tries to mimic how human vision works. By automatically ignoring irrelevant information from the data, neural networks can selectively focus on important features. This approach has achieved great success in Natural Language Processing (NLP) , image captioning and VQA . There are many variants of the attention mechanism. Our approach are mainly motivated by self-attention and co-attention based methods. The self-attention mechanism transforms features into query, key and value features. The attention matrix between different features are then calculated by the inner product of query and key features. After acquiring the attention matrix, features are aggregated as the attention-weighted summation of the original features. Motivated by the self-attention mechanism, many vision tasks’ performances were improved significantly. Non-local neural network proposed a non-local module for aggregating information between different frames within one video and achieved state-of-the-art performance in video classification. Relation Network learn the relationship between object proposals by adopting the self-attention mechanism. The in-place module can boost Faster RCNN and Non-Maximum-Suppression (NMS) performance.

Co-attention-based methods. The co-attention based vision and language methods model the interactions across the two modalities. For each word, every image region features are aggregated to the word according to the co-attention weights. The co-attention mechanism has been widely used in NLP and VQA tasks. In , Dense Symmetric Co-attention (DCN) has been proposed. It achieved state-of-the-art performances on VQAv1 and VQAv2 datasets without using any bottom-up and top-down features. The success of DCN is due to dense concatenation of symmetric co-attention.

Other works for language and vision tasks. Beyond above mentioned methods, many algorithms have also been proposed for fusion of cross-modal language and visual features. Dynamic Parameter Prediction and Question-guided Hybrid Convolution utilized dynamically predicted parameters for feature fusion. Adaptive attention introduced a visual sensual which can skip attention during image captioning. Structured attention adopted the MRF model over attention maps for better modelling better spatial attention distributions. Locally weighted deformable neighbours proposed to predict offset and modulation weight.

Dynamic Fusion with Intra- and Inter-modality Attention Flow for VQA

The proposed approach consists of a series of DFAF modules. The whole pipeline is illustrated at figure 1 Visual and language features between the two modalities are first weighted with the co-attention mechanism and aggregated between the modalities to each image region and each word by the proposed Inter-modality Attention Flow (InterMAF) module, which learns the cross-modal interactions between the image regions and question words. Following the inter-modality module, to model the relationships within each modality, i.e., word-to-word relations and region-to-region relations, the Dynamic Intra-modality Attention Flow (DyIntraMAF) module is adopted. It weights words and regions within each modality and aggregates their features to the words and regions again, which could be viewed as passing information flows within each modality. Importantly, in our proposed intra-modality module, the attention flows are dynamically conditioned on the information from the other modality, which is a key difference compared with existing self-attention based methods. Such InterMAF and DyIntraMAF modules could be stacked multiple times to pass the information flows among words and regions iteratively to model the latent alignments for visual question answering.

2 Base visual and language feature extraction

The obtained visual object region features $R$ and question features $E$ could be denoted as

where visual feature parameters $\theta_{\text{RCNN}}$ are fixed while question features $\theta_{\text{GRU}}$ are learned from scratch and updated together when training our proposed framework.

3 Inter-modality Attention Flow

The Inter-modality Attention Flow (InterMAF) module as shown in Figure 1 first learns to capture the importance between each pair of visual region and word features. It then passes information flows between the two modalities according to the learned importance weights and aggregate features to update each word feature and image region feature. Such an information flow process is able to identify cross-modal relations between visual regions and words.

where “Linear” denotes a fully-connected layer with parameter $\theta$ , and $dim$ represents the common dimension of transformed features from both modalities.

The inner product values are proportional to the dimension of hidden feature space, thus need to be normalized by the square root of hidden dimension. The softmax non-linearity function is applied row-wisely.

where $E_{V}$ and $R_{V}$ are the un-weighted information flows(value features) to update visual region features and word features in Eq. (5), and the two InterMAF matrices are used to weight such information flows.

After acquiring the updated visual and word features, we concatenate them with original visual features $R$ and word features $E$ . A fully connected layer is utilized to transform the concatenated features into output features,

The output features by the InterMAF module would then be fed into the following Dynamic Intra-modality Attention Flow module for learning intra-modality information flows to further update the visual region and word features for capturing region-to-region and word-to-word relations.

4 Dynamic Intra-modality Attention Flow

The input visual regions and word features of DyIntraMAF have encoded cross-modal relations between visual regions and words. However, we argue that relationships within each modality are complementary to the cross-modal relations and should be taken into account for improving the VQA accuracy. For example, for the question, “who is above the skateboard ?”, the intra-modality module should relate the region above the skateboard and the skateboard region to infer the final answer. Therefore, we propose the Dynamic Intra-modality Attention Flow (DyIntraMAF) module for modelling such within-modality relations with a dynamic attention mechanism. The implementation of DyIntraMAF is illustrated at Figure 2.

The naive intra-modality matrices to capture the importance between regions and between words could be defined similarly to Eq. (5) as,

The dot products are utilized to estimate their within-modality importance between the same modality’s query and key features. Such weight matrices could then be used to weight the information flows transmitted within each modality. Modelling within-modality relationships have been shown to be effective in object detection , image captioning and BERT word embedding pretraining .

However, the naive IntraMAF module only utilizes within-modality information for estimating the region-to-region and word-to-word importance. Some relations are important but could only be identified conditioned on information from the other modality. For instance, even for the same input image, relations between different visual region pairs should be weighted differently according to different questions. We therefore propose a Dynamic Intra-modality Attention Flow (DyIntraMAF) module to estimate within-modality relation importance conditioned on the information from the other modality.

To summarize the conditioning information from the other modality, we average pool the visual region features along the object-index dimension and the word features along the word-index dimension. The average pooled features of both modalities are then transformed to a $dim$ -dimensional feature vector to match the dimension of the query and key features $R_{Q},R_{K},E_{Q},E_{K}$ . The $dim$ -dimensional feature vector of each modality is then processed by a sigmoid non-linearity function $\sigma(\cdot)$ to generate channel-wise conditioning gates for the other modality,

The query and key features from both modalities are then modulated by the conditional gates from the other modality

where $\odot$ denotes element-wise multiplication. Channels of query and key features would be activated or deactivated by channel-wise gates conditioned on the other modality. Such a design of the two gating vectors share the similar spirit with Squeeze and Excitation Network and the Gated Convolution . The key difference is that the channel-wise gating vector is created based on cross-modal information.

Visual region and word features are then updated by the weighted value features $R_{V}$ and $E_{V}$ via residual,

Note that here we only make key and query features conditioned on the other modality to adaptively weight within-modality information flows. In our ablation studies, we observe that the proposed DyIntraMAF module by a large margin outperforms the naive IntraMAF module.

5 The Framework with Intra- and Inter-modality Attention Flow

In this section, we introduce how to integrate intra- and inter-modality attention flow modules into our proposed framework. The whole pipeline is illustrated in Figure 1.The proposed framework first extracts visual region features and word features from the input image and question by utilizing the Faster RCNN and GRU models, respectively. Faster R-CNN model weights are fixed during training our proposed framework, while GRU weights are updated with our framework from scratch.

After visual region features and word features being transformed into vectors of the same dimension by fully connected layers. The InterMAF module passes information flows between each pair of visual region and question word and aggregates updated features to each region and each word. Such aggregated features integrate information from the other modality to update the visual and word features according to the cross-modal relations.

Given the InterMAF outputs, the DyIntraMAF module is utilized for dynamically passing information flows within each modality. The visual region and word features would be updated again with information within the same modality via residual connections.

We use one InterMAF module followed by one DyIntraMAF module to form a basic block in our proposed DFAF framework. Multiple blocks could be stacked thanks to the feature concatenation and residual connection in the feature updating procedures. Very deep intra- and inter-modality information flows can be effectively trained with stochastic gradient descent. In addition, we utilize multi-head attention in practice. The original features are split along channel dimensions into groups and different groups would generate parallel attentions to update visual and word features in different groups independently.

6 Answer Prediction Layer and Loss Function

After several blocks of feature updating by InterMAF and DyIntraMAF modules, we obtain the final visual region and word features encoding inter-modality and intra-modality relations for VQA. By average pooling over region features and over word features, we could obtain discriminative representations for image and question, respectively. Such features could then be fused via either feature concatenation, or feature element-wise product, or feature addition to obtain fused features. We experiment with the three fusion approaches in which the element-wise product between visual and language representations achieves the best performance with a trivial margin.

Similar to state-of-the-art VQA approaches, we treat VQA as a classification problem. The fused multi-modal features are transformed into a probability vector by a 2-layer multi-layer perceptron with ReLU non-linearity function between the layers and a final softmax function. The ground-truth answers are extracted from annotated answers that appear for more than 5 times. Cross-entropy loss function is adopted as the objective function.

Experiments

We used VQA version 2.0 for our experiment. VQA dataset contains human annotated question-answer pairs for images from Microsoft COCO dataset . VQA 2.0 is an updated of previous VQA 1.0 with much more annotations and less dataset bias. VQA 2.0 is split into train, validation and test-standard sets. Among test-standard test, 25% are served as test-dev set. All questions types are divided into Yes/No, Number and other categories. Train, validation and test-standard contains 82,783, 40,504 and 81,434 images, with 443,757, 214,354, 447,793 questions,respectively. Each question contains 10 answers from different annotators. Answers with the highest frequency are treated as the ground-truth. Following previous approaches, we perform ablation studies over the validation set and utilize the train and validation splits for test.

2 Experimental Setup

Visual features of dimension 2048 are extracted from Faster R-CNN while word features are encoded into features of dimension 1280 by GRU. The visual features and word features are then embedded into 512 dimensions by a fully-connected layer, respectively. Inside InterMAF, features are transformed into 8 multi-head attention with 64 dimensions for each head. For DyIntraMAF, the average pooled features from both modality are transformed into 512 dimensions by MLP followed by element-wise sigmoid activation to obtain the conditioning gating vectors. They are then multiplied with 512 dimension visual key and query features at every position of visual and word features for dynamic attention flows. Previous approaches achieve significantly better results with sentinel and relative position information. However, sentinel and relative position do not affect the performance of our method.

All fully connected layers have the same dropout rate 0.1. All gradients are clipped to 0.25. Batch size is set as 512. Adamax optimizer , a variant of Adam, is used. The learning rate is set as $10^{-3}$ for the first 2 epoch, $2\times 10^{-3}$ for the next 8 epochs and decayed by 1/4 for the rest epochs. Our method is implemented with Pytorch . All initilizations are Pytorch default initilization.

All ablation studies are conducted on the validation dataset, while train, validation datasets and extra visual genome dataset are combined for testing on test-dev.

3 Ablation study of DFAF

We perform extensive ablation studies on the VQA 2.0 validation dataset . The results are shown in Table 1. Our default setting only has 1 block of DFAF module. Region features with 2,048 dimensions are extracted from the input image by Faster RCNN , word features with 1,024 dimensions are extracted by GRU . By default, all modules inside DFAF has 512 dimensions. In the final fusion layer, feature multiplication is employed, which shows a trivial improvement. Visual sentinel and bounding box position embedding are also tested which give a slight drop in the final performance. 8 parallel attention heads with dimensions 64 for each head is utilized in the default setting.

We first investigate the influence of the number of stacked DFAF blocks. The default setting has one stack. As one can see from Table 1, more stacks can improve the performance thanks to the residual connection . Different from ResNet, we do not employ any normalization technique during residual connection. The performance of single layer DFAF is comparable with BAN-12 .

Then, we investigate the influence of attention type. The attention mechanism in Bottom up utilizes simple attention methods. Bilinear attention network proposed a bilinear attention which learns the joint attention distribution between each word and region pairs. By adding the InterMAF, performance can improve by 1% because of the modelling the inter-modality relations between image regions and question words. Integrating only the IntraMAF module would harm the performance because too many unrelated information flows hinder the learning process. By adding dynamically conditioned information flow DyIntra MAF module, we achieve a 2.15% performance improvement. By combining Intra- and Inter-modality attention flows, we significantly outperform the baseline by 2.83% and previous state-of-the-art BAN-1 by 0.85%.

There are several orders for passing information within the InterMAF module, namely, parallel and sequential . For parallel InterMAF, both region and word features are updated at the same time. For the sequential information flow, we experiment with passing attention flow from regions to words first, which updates word features, and then passing message from words to regions, which then update region features, and vice versa. We denote the first sequential order as $R\rightarrow E$ , $E\rightarrow R$ , and the second one as $E\rightarrow R$ , $R\rightarrow E$ . Sequential update outperforms parallel update way, while the specific order does not matter.

Next, we perform ablation study on the influence of embedding dimension and cross-model feature fusion. 512 dimensions result in better performance than 1024 dimensions. For the fusion method, multiplication shows a slight better performance than feature addition and concatenation.

Visual sentinel were utilized in many previous VQA methods, which was shown to increase the VQA accuracy. We treat sentinel as a general 512 dimension features and concatenate sentinel with all region and word features. Previous $\mu$ region features and 14 word features become into $\mu+1$ and 15 respectively. In our experiments, adding visual sentinel do not show improvement.

The positions of bounding boxes were widely utilized as a part of image region features in previous methods. Absolute position embedding has been employed in Transformer , BERT and Gated CNN in NLP. Relative position was adopted in relation network for object detection. In our experiment, adding absolute or relative positions would drop the performance.

At last, we experiment on the influence of multi-head attention . We keep the overall dimensions to be 512. 1, 4 and 8 attention heads are experimented. As can be seen in Table 1, 8 attention can achieve better performance at the same number of parameters.

4 Visualisation of the proposed Attention Flow Weights

In Figure 3, we visualise the intra-modality attention flow weights to analyse VQA model. The attention weights modulate information flow from contextual regions(orange, blue and green) to center region(red). The left column stands for the attention flow weights in the IntraMAF module. While the rest columns represent dynamic attention flow weights in the DyIntraMAF module. In the DyIntraMAF module, unrelated information flow are filtered out by question features and thus generate the correct answer.

5 Comparison with State-of-the-arts methods

Table 2 shows the performance of our proposed algorithm trained with extra visual genome dataset and state-of-the-art methods on VQA. Bottom up in Table 2 is the winner of VQA challenge 2017. This approach proposed to use features based on Faster RCNN instead of ResNet . Multi-modal Factorized High-order Pooling (MFH) is a state-of-the-art bilinear pooling methods. Dense Co-Attention Network (DCN) utilized dense stack of multiple layers of Co-attention mechanism which significantly outperform previous methods with ResNet features. Counting methods are good at counting questions by utilizing the information of bounding boxes. Bilinear Attention Network (BAN) is a state-of-the-art approach on VQA 2.0 which has 12 stacked blocks of BAN modules. By utilising contextualised word embedding BERT , the performance of VQA can be further boosted.

Conclusions

In this paper, we proposed a novel framework Dynamic Fusion with Intra- and Inter-modality Attention Flow (DFAF) for visual question answering. The DFAF framework alternatively passes information within and across different modalities based on an inter-modality and intra-modality attention mechanisms. The information flow inside visual features are dynamically conditioned on the question features. Stacking multiple blocks of DFAF are shown to improve the performance of VQA.

Acknowledgment

This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616, CUHK14208417, CUHK14239816, and in part by CUHK Direct Grant.