Multi-modality Latent Interaction Network for Visual Question Answering

Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, Hongsheng Li

Introduction

Visual Question Answering has received increasing attention from the research community. Previous approaches solve the Visual Question Answering (VQA) by designing better features , better bilinear fusion approaches or better attention mechanisms . Recently, relational reasoning has been explored for solving VQA and significantly improved performance and interpretability of VQA systems.

Despite relationships has been extensively adopted in different tasks, such as object detection , language modelling , image captioning and VQA . Relational approaches for VQA were only proposed for modelling relationship between words and visual regions. Thus, relational reasoning requires large GPU memories because it needs to model relations between every pair. For VQA, modeling relationships between individual words and visual regions is not enough to correctly answer the question.

To model more complex cross-modality relations, we propose a novel Multi-modality Latent Interaction Network (MLIN) with MLI modules. Different from existing relational VQA methods, the MLI module first encodes question and image features into a small number of latent visual and question summarizaiton vectors. Each summarization vector can be formulated as the weighted pooling over visual or word features, which summarizes certain aspect of each modality from a global perspective and therefore encodes richer information compared with individual word and region features. After acquiring summarizations for each modality, we establish visual-language associations between the multi-modal summarization vectors and propose to propagate information between summarization vectors to model the complex relations between language and vision. Each original visual region and word feature would finally aggregate information from the updated latent summarizations using attention mechanisms and residual connections to predict the correct answers.

Our proposed MLIN achieves competitive performance on VQA benchmarks, including VQA v2.0 and TDIUC . In addition, we experiment how to combine pre-trained language model BERT to improve VQA models. After integrating with BERT , MLIN achieves better performance compared with state-of-the-art models.

Our proposed MLIN is related to the attention-based approaches. An illustration between previous approaches can be seen from Figure 1. Previous attention approaches that aggregate information can be classified into the following categories: (1) The co-attention mechanism aggregates information from the other modality. (2) Transformer aggregates information inside each modality using key-query attention mechanism. (3) The intra- & inter-modal attention(DFAF) propagate and aggregate information within and across multiple modalities. For intra-modality feature aggregation, attention is dynamically modulated by the other modality using the pooled features. Compared with previous approaches, MLIN does not aggregate features just from the large number of individual visual-word pairs but from the small number of multi-modal latent summarization vectors, which can capture high-level visual-language interactions with much smaller modal capacity.

Our contributions can be summarized into two-fold. (1) We propose the MLIN for modelling multi-modality interactions via a small number of multi-modal summarizations, which helps encode the relationships across modalities from global perspectives and avoids capturing too much uninformative region-word relations. (2) We carried out extensive ablation studies over each components of MLIN and achieve competitive performance on VQA v2.0 and TDIUC benchmarks. Besides, we provide visualisation of our LMIN and have a better understanding about the interactions between multi-modal summarizations. We also explore how to effectively integrate the pre-trained language model into the proposed framework for further improving the VQA accuracy.

Related Work

Learning good representations have been the foundations for advancing vision and Natural Language Processing (NLP) research. For computer vision, AlexNet , VGGNet , ResNet and DenseNet features achieved great success on image recognition . For NLP, word2vec , GloVe , Skipthough , ELMo , GPT , VilBERT and BERT achieved great success at language modelling. The successful representation learning in vision and language has much benefitted multi-modality feature learning. Furthermore, bottom-up & top-down features for VQA and image captioning greatly boosted the performance of multi-modality learning based on the additional visual region (object detection ) information.

2 Relational Reasoning

Our work is mostly related to the relational reasoning approaches. Relational reasoning approaches try to solve VQA by learning the relationships between individual visual regions and words. Co-attention based approaches can be seen as modelling the relationship between each word and visual region pairs using the attention mechanism. Transformer proposed to use the key-query-value attention mechanism to model the relationship inside each modality. Simple relational networks reason over all region pairs in the image by concatenating region features. Besides VQA, relational reasoning has improved performance in other research areas. Relational reasoning has been applied to object detection and show that modelling relationships could help object classification and non-maximum suppression. Relational reasoning has also been explored in image captioning using graph neural networks. Non-local network shows that modelling relationship across video frames can significantly boost video classification accuracy.

3 Attention-based Approaches for VQA

Attention-based approaches have been extensively studied for VQA. Many relational reasoning approaches using attention mechanisms to aggregate contextual information. Soft and hard attention has been first proposed by Xu et al., which has become the main-stream in VQA systems. Yang et al. proposed to stack several layers of attention to gradually focus on the most important regions. Lu et al. proposed co-attention-based methods, which can aggregate information from the other modality. Vaswani et al. aggregated information inside each modality for solving machine translation. Nguyen et al. proposed a densely connected co-attention mechanism for VQA. Bilinear Attention Network generated attention weights by capturing the interactions between each feature channel. Structured attention added a Markov Random Field (MRF) model over the spatial attention map for modelling spatial importance. Besides VQA, Chen et al. proposed spatial-wise and channel-wise attention mechanisms, which can modulate information flow spatial-wise and channel-wise for image captioning. In referring expression, Xihui et al. propose attention guided feature erasing.

4 Dynamic Parameter Prediction

Dynamic parameter prediction (DPP) propose another direction for multi-modality feature fusion. Noh et al. firstly proposed a DPP-based multi-modality fusion approach by predicting the weights of fully connected layer using question features. Perez et al. achieved competitive VQA performance compared with complex reasoning approaches on the CLEVR dataset by predicting the normalisation parameter of visual features. Furthermore, Gao et al. proposed to modulate visual features by predicting convolution kernels from the input question. Hybrid convolution was proposed to reduce the number of parameters without hindering the overall performance. Beyond VQA, DPP-based approaches have been adopted for transfer learning between classification and segmentation .

Multi-modality Latent Interaction Network

Figure 2 illustrate the overall pipeline of our proposed Multi-modality Latent Interaction Network (MLIN). The proposed MLIN consists of a series of stacking Multi-modality Latent (MLI) modules, which aims to summarize input visual-region and question-word information into a small number of latent summarization vectors for each modality. The key idea is to propagate visual and language information among the latent summarization vectors to model the complex cross-modality interactions from global perspectives. After information propagation among the latent interaction summarization vectors, visual-region and word features would aggregate information from the cross-domain summarizations to update their features. The inputs and outputs of the MLI module has the same dimensions and the overall network stacks the MLI module for multiple stages to gradually refine the visual and language features. In the last stage, we conduct elementwise multiplication between the average features of visual regions and question words to predict the final answer.

where $\theta_{\text{RCNN}}$ and $\theta_{\text{Transformer}}$ denote the network parameters for visual and language feature encoding.

2 Modality Summarizations in MLI Module

Summarization module can be seen from the Summarization part of Figure 2. After acquiring visual and question features, we add a lightweight neural network to generate $k$ sets of latent visual or language summarization vectors for each modality. The $k$ sets of linear combination weights are first generated via

Each of the $k$ latent visual or language summarization vectors (i.e., each row of $\overline{R}$ or $\overline{E}$ ) is a linear combination of the input individual features, which is able to better capture high-level information compared with individual region-level or word-level features. The $k$ summarization vectors in each modality can capture $k$ different aspects of the input features from global perspectives.

3 Relational Learning on Multi-modality Latent Summarizations

can be considered as a latent representation that deeply encodes the cross-domain relations between the latent summarization vectors in the two modalities.

where $U_{R}\cdot\hat{A}$ and $U_{E}\cdot\hat{A}$ aggregate the informamtion from the latent representations to obtain the updated region and word features $R_{U}$ and $E_{U}$ . The feature aggregation process has been illustrated in the Aggregation module in Figure 2.

The input features $R,E$ and output features $R_{U},E_{U}$ of the above introduced MLI module shares the same dimension. Motivated by previous approaches , we stack MLI modules for multiple stages to recursively refine the visual and language features. After several stages of MLI modules, we average pool the visual and word features separately and elementwisely multiplicate the deeply refined region and word features for multi-modal feature fusion. A final linear classifier ( $W_{cls},b_{cls}$ as parameters) with softmax non-linearity function is adopted for answer prediction,

Accordingly, the overall system is trained in an end-to-end manner with cross-entropy loss function.

4 Comparison of Message Passing Complexity

In this section, we compared the message passing complexity between co-attention , self-attention and intra-inter attention . The information flow pattern has been illustrated in Figure 1. For co-attention, the number of message passings is $\mathcal{O}(2\times M\times N)$ because each word would calculate an attention matrix from each visual region and vice versa. For self-attention, the number of message passings is $\mathcal{O}(M\times M+N\times N)$ . The number of message passings for intra- and inter-modality attention is the summation of those of self-attention and co-attention, $\mathcal{O}((M+N)\times(M+N))$ . Generally, in bottom-up & top-down attention , 100 region proposals would be used for multi-modal feature fusion. The quadratic number of message passings in self attention and intra- and inter-modality attention flow would requires large GPU memories and hinders the relational learning as well. For our proposed MLIN framework, the MLI module generates $k$ latent summarization vectors for each modality. After relational reasoning, $k\times k$ features are generated. In the final feature redistribution stage, $\mathcal{O}(k\times k\times N)$ message passings are performed for question feature update, and $\mathcal{O}(k\times k\times M)$ message passings are required for updating region features. The total number of message passings for our proposed MLIN in each stage is therefore $\mathcal{O}(k\times k\times(M+N))$ . Our proposed multi-modality latent representations could better capture multi-modality interactions with much fewer message passings and achieved competitive performance compared with DFAF. A performance comparison has been conducted in the experiments session.

Experiments

We conduct experiments on VQA v2.0 and TDIUC datasets. Both VQA v2.0 and TDIUC contain question-image pairs collected from Microsoft COCO dataset and annotated questions. VQA v2.0 is an updated version of VQA v1.0 by reducing data bias. VQA v2.0 contains train, validation and test-standards and 25 $\%$ of test-standards serve as the test-dev set. Performance evaluation on VQA v2.0 includes evaluating accuracies of different types of questions: YES/NO, Number, Others and overall accuracy. Train, validation and test sets contain 82,743, 40,504 and 81,434 images, with 443,757, 214,354 and 447,793 questions, respectively. We carry out extensive ablation studies on the validation set of VQA v2.0 trained on train split. Also, we report final performance on VQA v2.0 test set trained on the combination of train and validation set, which is a common practice of most previous approaches listed in Table 2. Although VQA v2.0 has been commonly adopted as the most important benchmark on VQA. However, Kafke et al. found that the performance of VQA v2.0 is dominated by simple questions, which make it difficult to compare different approaches. To solve the bias problem existing in VQA v2.0, TDIUC collect 1.6 million questions divided into 12 categories.

2 Experimental Setup

3 Ablation Study on VQA2 Validation

We carried out extensive ablation studies on evaluating the effectiveness of each module in our proposed MLIN in Table 1. The default setting is one stage MLIN where all features are transformed into dimension of 512. We create 6 summarizations for each modality. For the feature aggregation key-query attention module, we adopted a 12 head multi-head attention with each head calculating 128-dimensional features. In ablation study, we check the influence of the number of MLIN stacks, number of latent summarisation vectors, latent interaction, latent propagation, feature aggregation and final feature fusion operator.

Similarly with BAN and DFAF , we stack the proposed MLI module for 5 and 8 times denoted as MLIN-5 and MLIN-8 for multiple stage reasoning. We observe that deeper layers will improve the performance and can be optimized by SGD thanks to the residual connections .

Then we study the influence of the number of question and visual summarization vectors. Too few summarization vectors will be unable to capture different aspects of the input which deteriorates the overall performance. Too many summarization vectors will require too much GPU memory and computations with marginal improvement. We choose 6 question summarization and 6 visual summarization vectors as a trade-off between performance and computation.

For the interaction operator to create paired summarization vectors, we compare between element-wise product, element-wise addition and bilinear fusion (MUTAN) for multi-modality summarization fusion. Bilinear fusion gives the best performance. However, we choose elementwise product in our final model considering the overall simplicity and efficiency of the network design. Different from our approaches, Simple Relational Reasoning Network choose concatenation by default.

For the simplicity of hyper-parameter selection, we set all layers have the same dimension. Extracted visual and question features are transformed into the same dimension by linear transform. 1024 leads to better performance than 512. However, stacking multiple MLI modules can lead to more performance improvement than being wide. Our final model chooses 512 dimensions by default.

Among the latent paired summarization vectors, there exist several ways for propagating information between them. Self-attention uses key-query attention to aggregate information from the other latent summarizations. while dual attention aggregate information inside and outside each feature vector simultaneously using self attention. In our experiment, our proposed relational propagation operations (e.g. Equation 7,8,9) could achieve better performance than the complicated dual attention.

After acquiring latent interaction features, the original question and visual features will gather information from the latent vectors to complete multi-modality relational learning. We tested two approaches for feature gathering from latent vectors. We use the key of visual and word feature to gather information from the query of latent vectors and perform weighted pooling of latent summarization vectors. Motivated by the dynamic attention weight prediction network , we use the the transpose of attention weight in the summarization stage to gather information from latent summarization vectors. Key-query attention approach outperform dynamic attention weight prediction.

Another hyper-parameter in feature gathering stage is the number of attention heads and head dimension in the feature aggregation stage, we keep the dimension of each heads as 128 and test the number of parallel attention head with number of 8, 12 and 16. The obtained features of different heads are concatenated to obtain the final features.

Language model has been actively investigated in NLP related tasks. Language models can generate feature that better capture language meanings. BERT is a language model pretrained by randomly masking a word or predicting whether one sentence is next to the other sentence. As can be seen from the table, finetuning the MLIN+BERT model by setting its learning rate to 1/10 of the main learning rate will awaken the full power of BERT.

4 Comparison with State of the art methods

In this section, we compare our proposed MLIN with previous state-of-the-art methods on VQA v2.0 and TDIUC datasets in Table 2 and 3. Following previous methods, we compare our methods on VQA v2.0 test dataset trained with train, validation split and visual genome augmentation.

On VQA v2.0, we divide previous approaches into non-relational and relational approaches which are two orthogonal research directions and can assist each other. Bottom-Up-Top-Down(BUTD) approach proposed to use object detection features in a simple attention module for answering the question related to the input image. MFH is the state-of-the-art bilinear fusion approach. By switching from Residual features to Bottom-up-top-down features, better accuracy can be achieved. BAN proposed a bilinear attention mechanism which generates a multi-modality attention using information of each channel and has won the first place in the single model task of VQA competition 2018.

Besides feature fusion, relational reasoning has been paid much attention in solving VQA. DCN proposed a densely connected co-attention module for cross-modality feature learning. $<subject,predicate,object>$ triples are created for VQA reasoning in Relation prior . Conditional Graph built a graph among all region proposals and condition this graph on visual question. Although Conditional Graph is less competitive compared with other approaches. However, the interpretation from conditional graph is quite useful for diagnosing VQA problem. Counter dives into the number question of VQA by utilising the relative position between bounding box for learning efficient Non Maximum Suppression(NMS). DFAF is a multi-layer stacked network by combining intra- and inter- modality information flow for feature fusion. Furthermore, DFAF can dynamically modulate the intra modality information flow using the average pooled features from the other modality. MLI use 100 region proposals for fair comparison.

VQA 2.0 has been mostly adopted as the most important benchmark in VQA. Since VQA 2.0 is dominated by simple samples, which is hard to discriminate between different methods. We also compare with approaches on the TDIUC dataset. QTA is the state-of-the-art methods on TDIUC, which proposed a question type guided attention with both bottom-up-top-down features and residual features. Our proposed MLIN can achieve better performance even with bottom-up-top-down features only. Our method also outperform DFAF on this dataset.

5 Visualization

We visualize the attention weight of summarization vector in Figure 3. We discover the following patterns. Different summarization have a specific function. As can be seen from the visualization of attention weight, different summarization vectors focus on different global information. The first attention weight collect information from the background, while the second attention weight focuses on the most important regions for answering the question. While the third attention performs weighted pooling of regions with a strong interaction for answering the question.

Conclusion

In this paper, we proposed a novel MLIN for exploring relationship for solving VQA. Inside MLIN, multi-modality reasoning is realized through the process of Summarisation, Interaction, Propagation and Aggregation. MLIN can be stacked several layers for better relationship reasoning. Our method achieved competitive performance on benchmark VQA dataset with much smaller message passing times. Furthermore, we show a good pre-trained language model question encoder is important for VQA performance.

Acknowledgements

This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616, CUHK14208417, CUHK14239816, in part by CUHK Direct Grant.