Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

Zhou Yu, Jun Yu, Jianping Fan, Dacheng Tao

Introduction

Thanks to recent advances in computer vision and natural language processing, computers are expected to be able to automatically understand the semantics of images and natural languages in the near future. Such advances have also stimulated new research topics like image-text retrieval , image captioning , and visual question answering .

Compared with image-text retrieval and image captioning (which just require the underlying algorithms to search or generate a free-form text description for a given image), visual question answering (VQA) is a more challenging task that requires fine-grained understanding of the semantics of both the images and the questions as well as supports complex reasoning to predict the best-matching answer correctly. In some aspects, the VQA task can be treated as a generalization of image captioning and image-text retrieval. Thus building effective VQA algorithms, which can achieve close performance like human beings, is an important step towards enabling artificial intelligence in general.

Existing VQA approaches usually have three stages: (1) representing the images as visual features and questions as textual features; (2) combining these multi-modal features to obtain fused image-question features; (3) using the integrated image-question features to learn a multi-class classifier and to predict the best-matching answer. Deep neural networks (DNNs) are effective and flexible, many existing approaches model the three stages in one DNN model and train the model in an end-to-end fashion through back-propagation. In the three stages, feature representation and multi-modal feature fusion particular affect VQA performance.

With respect to multi-modal feature fusion, most existing approaches simply use linear models for multi-modal feature fusion (e.g., concatenation or element-wise addition) to integrate the image’s visual feature with the question’s textual feature . Since multi-modal feature distributions may vary dramatically, the integrated image-question representations obtained by such linear models may not be sufficiently expressive to fully capture complex associations between the visual features from images and the textual features from questions. In contrast to linear pooling, bilinear pooling has recently been used to integrate different CNN features for fine-grained image recognition . However, the high dimensionality of the output features and the huge number of model parameters may seriously limit the applicability of bilinear pooling. Fukui et al. proposed the Multi-modal Compact Bilinear (MCB) pooling model to effectively and simultaneously reduce the number of parameters and computation time using the Tensor Sketch algorithm . Using the MCB model, the group proposed a network architecture for the VQA task and won the VQA challenge 2016. Nevertheless, the MCB model lies on a high-dimensional output feature to guarantee robust performance, which may limit its applicability due to huge memory usage. To overcome this problem, Kim et al. proposed the Multi-modal Low-rank Bilinear (MLB) pooling model based on the Hadamard product of two feature vectors . Since MLB generate output features with lower dimensions and models with fewer parameters, it is highly competitive with MCB. However, MLB has a slow convergence rate and is sensitive to the learned hyper-parameters. To address these issues, here we develop the Multi-modal Factorized Bilinear pooling (MFB) method, which enjoys the dual benefits of compact output features of MLB and robust expressive capacity of MCB.

With respect to feature representation, directly using global features for image representation may introduce noisy information that is irrelevant to the given question. Therefore, it is intuitive to introduce visual attention mechanism into the VQA task to adaptively learn the most relevant image regions for a given question. Modeling visual attention may significantly improve performance . However, most existing approaches only model image attention without considering question attention, even though question attention is also very important since the questions interpreted in natural languages may also contain colloquialisms that can be regarded as noise. Therefore, based on our MFB approach, we design a deep network architecture for the VQA task using a co-attention learning module to jointly learn both image and question attentions.

To summarize, the main contributions of this study are as follows: First, we develop a simple but effective Multi-modal Factorized Bilinear pooling (MFB) approach to fuse the visual features from images with the textual features from questions. MFB significantly outperforms existing multi-modal bilinear pooling approaches such as MCB and MLB . Second, based on the MFB module, a co-attention learning architecture is designed to jointly learn both image and question attention. Our MFB approach with co-attention model achieves the state-of-the-art performance on the VQA dataset. We also conduct detailed and extensive experiments to show why our MFB approach is effective. Our experimental results demonstrate that normalization techniques are extremely important in bilinear models.

Related Work

In this section, we briefly review the most relevant research on VQA, especially those studies that use multi-modal bilinear models.

Malinowski et al. made an early attempt at solving the VQA task. Since then, solving the VQA task has received increasing attention from the computer vision and natural language processing communities. VQA approaches can be classified into the following methodological categories: the coarse joint-embedding models , the fine-grained joint-embedding models with attention and the external knowledge based models .

The coarse joint-embedding models are the most straightforward VQA solutions. Image and question are first represented as global features and then integrated to predict the answer. Zhou et al. proposed a baseline approach for the VQA task by using the concatenation of the image CNN features and the question BoW (bag-of-words) features, with a linear classifier learned to predict the answer . Some approaches introduce more complex deep models, e.g., LSTM networks or residual networks , to tackle the VQA task in an end-to-end fashion.

One limitation of coarse joint-embedding models is that their global features may contain noisy information, making it hard to correctly answer fine-grained problems (e.g., “what color are the cat’s eyes?”) . Therefore, recent VQA approaches introduce the visual attention mechanism into the VQA task by adaptively learning the local fine-grained image features for a given question. Chen et al. proposed a “question-guided attention map” that projects the question embeddings to the visual space and formulates a configurable convolutional kernel to search the image attention region . Yang et al. proposed a stacked attention network to learn the attention iteratively . Some approaches introduce off-the-shelf object detectors or object proposals as the attention region candidates and then use the question to identify related ones. Fukui et al. proposed multi-modal compact bilinear pooling to integrate image features from spatial grids with textual features from the questions to predict the attention . In addition, some approaches apply attention learning to both the images and questions. Lu et al. proposed a co-attention learning framework to alternately learn the image attention and the question attention . Nam et al. proposed a multi-stage co-attention learning framework to refine the attentions based on memory of previous attentions .

Despite joint embedding models for VQA delivering impressive performance, they are not good enough for answering problems that require complex reasoning or common sense knowledge. Therefore, introducing external knowledge is beneficial for VQA. However, existing approaches have either only been applied to specific datasets , or have been ineffective on benchmark datasets . There is room for further exploration and development.

2 Multi-modal Bilinear Models for VQA

Multi-modal feature fusion plays an important and fundamental role in VQA. After the image and question features are obtained, concatenation or element-wise summations are most frequently used for multi-modal feature fusion. Since the distributions of two feature sets in different modalities (i.e.,the visual features from images and the textual features from questions) may vary significantly, the representation capacity of the fused features may be insufficient, limiting the final prediction performance.

Fukui et al. first introduced the bilinear model to solve the problem of multi-modal feature fusion in VQA. In contrast to the aforementioned approaches, they proposed the Multi-modal Compact Bilinear pooling (MCB), which uses the outer product of two feature vectors to produce a very high-dimensional feature for quadratic expansion . To reduce the computational cost, they used a sampling-based approximation approach that exploits the property that the projection of two vectors can be represented as their convolution. The MCB model outperformed the simple fusion approaches and demonstrated superior performance on the VQA dataset . Nevertheless, MCB usually needs high-dimensional features (e.g., 16,000-D) to guarantee robust performance, which may seriously limit its applicability due to limitations in GPU memory.

Multi-modal Factorized Bilinear Pooling

Inspired by the matrix factorization tricks for uni-modal data , the projection matrix $W_{i}$ in Eq.(2) can be factorized as two low-rank matrices:

Relationship to MLB. Eq.(4) shows that the MLB in Eq.(1) is a special case of the proposed MFB with $k=1$ , which corresponds to the rank-1 factorization. Figuratively speaking, MFB can be decomposed into two stages (see in Fig. 1(b)): first, the features from different modalities are expanded to a high-dimensional space and then integrated with element-wise multiplication. After that, sum pooling followed by the normalization layers are performed to squeeze the high-dimensional feature into the compact output feature, while MLB directly projects the features to the low-dimensional output space and performs element-wise multiplication. Therefore, with the same dimensionality for the output features, the representation capacity of MFB is more powerful than MLB.

Network Architectures for VQA

The goal of the VQA task is to answer a question about an image. The inputs to the model contain an image and a corresponding question about the image. Our model extracts both the image and the question representations, integrates the multi-modal features using the MFB module in Figure 1(b), treats each individual answer as one class and performs multi-class classification to predict the correct answer. In this section, two network architectures are introduced. The first is the MFB baseline with one MFB module, which is used to perform ablation analysis with different hyper-parameters for comparison with other baseline approaches. The second network introduces co-attention learning which jointly learns the image and question attentions, to better capture fine-grained correlations between the image and the question, which may lead to a model with better representation capability.

The extracted image and question features are fed to the MFB module to generate the fused feature $z$ . Finally, $z$ is fed to a $N$ -way classifier with the KL-divergence loss. Therefore, all the weights except the ones for the ResNet (due to the limitation of GPU memory) are optimized jointly in an end-to-end manner. The whole network architecture is illustrated in Figure 2.

2 MFB with Co-Attention

For a given image, different questions could result in entirely different answers. Therefore, an image attention model, which can predict the relevance of each spatial grid to the question, is beneficial for predicting the accurate answer. In , 14 $\times$ 14 (196) image spatial grids (res5c feature maps in ResNet) are used to represent the input image. After that, the question feature is merged with each of the 196 image features using MCB, followed by some feature transformations (e.g., 1 $\times$ 1 convolution and ReLU activation) and softmax normalization to predict the attention weight for each grid location. Based on the attention map, the attentional image features are obtained by the weighted sum of the spatial grid vectors. Multiple attention maps are generated to enhance the learned attention map, and these attention maps are concatenated to output the attentional image features. Finally, the attentional image features are merged with the question features using MCB to determine the final answer prediction.

From the results reported in , one can see that incorporating an attention mechanism allows the model to effectively learn which region is important for the question, clearly contributing to better performance than the model without attention. However, the attention model in only focuses on learning image attention while completely ignoring question attention. Since the questions are interpreted as natural language, the contribution of each word is significantly different. Therefore, here we develop a co-attention learning approach (see Figure 3) to jointly learn both the question and image attentions.

The difference between the network architecture of our co-attention model and the attention model in is that we additionally place a question attention module after the LSTM networks to learn the attention weights of every word in the question. Different to other co-attention models for VQA , in our model, the image and question modules are loosely coupled such that we do not exploit the image features when learning the question attention module. This is because we assume that the network can directly infer the question attention (i.e., the key words of the question) without seeing the image, as humans do. We name this network MFB with Co-Attention (MFB+CoAtt).

Experiments

In this section, we conduct several experiments to evaluate the performance of our MFB models on the VQA task using the VQA dataset to verify our approach. We first perform ablation analysis on the MFB baseline model to verify the efficiency of the proposed approach over existing state-of-the-art methods such as MCB and MLB . We then provide detailed analyses of the reasons why our MFB model outperforms its counterparts. Finally, we choose the optimal hyper-parameters for the MFB module and train the model with co-attention (MFB+CoAtt) for fair comparison with other state-of-the-art approaches on the VQA dataset .

The VQA dataset consists of approximately 200,000 images from the MS-COCO dataset , with 3 questions per image and 10 answers per question. The data set is split into three: train (80k images and 248k questions), val (40k images and 122k questions), and test (80k images and 244k questions). Additionally, there is a 25 $\%$ test split subset named test-dev. Two tasks are provided to evaluate performance: Open-Ended (OE) and Multiple-Choices (MC). We use the tools provided by Antol et al. to evaluate the performance on the two tasks.

2 Experimental Setup

For the VQA dataset, we use the Adam solver with $\beta_{1}=0.9$ , $\beta_{2}=0.99$ . The base learning rate is set to 0.0007 and decays every 40,000 iterations using an exponential rate of 0.5. We terminate training at 100,000 iterations (200,000 iterations if the training set is augmented with the large-scale Visual Genome dataset ). Dropouts are used after each LSTM layer (dropout ratio $p=0.3$ ) and MFB module ( $p=0.1$ ) like . The number of answers $N=3000$ . For all experiments (except for the ones shown in Table 2, which use the train and val splits together as the training set like the comparative approaches), we train on the train split, validate on the val split, and report the results on the test splitthe submission attempts for the test-standard split are strictly limited. Therefore, we evaluate most of our settings on the test-dev split and only report the best results on the test-standard split.. The batch size is set to 200 for the models without the attention mechanism, and set to 64 for the models with attention (due to GPU memory limitation). All experiments are implemented with the Caffe toolbox and performed on a workstation with GTX 1080 GPUs.

3 Ablation Analysis

First, MFB significantly outperforms MCB and MLB. With 5/6 parameters, MFB( $k=5,o=1000$ ) achieves about a 1 $\%$ accuracy improvement compared with MCB. Moreover, with only 1/3 parameters , MFB( $k=5,o=200$ ) obtains similar results to MCB. These characteristics allows us to train our model on a memory limited GPU with larger batch-size. Furthermore, the validation accuracy of MCB suffers from overfitting with the high-dimensional output features. In comparison, the performance of our MFB model is relatively robust.

Second, when $ko$ is fixed to a constant, e.g., 5000, the number of factors $k$ affects the performance. Increasing $k$ from 1 to 5, produces a 0.5 $\%$ performance gain. When $k=10$ , the performance has approached saturation. This phenomenon can be explained by the fact that a large $k$ corresponds to using a large window to sum pool the features, which can be treated as a compressed representation and may loss some information. When $k$ is fixed, increasing $o$ does not produce further improvements. This suggests that high-dimensional output features may be easier to overfit. Similar results can be seen in . In summary, $k=5$ and $o=1000$ may be a suitable combination for our MFB model on the VQA dataset, so we use these settings in our follow-up experiments.

4 Comparison with State-of-the-art

Table 2 compares our approaches with the current state-of-the-art. The table is split into four parts over the rows: the first summarizes the methods without introducing the attention mechanism; the second includes the methods with attention; the third illustrates the results of approaches with external pre-trained word embedding models, e.g., GloVe or Skip-thought Vectors (StV) ; and the last includes the models trained with the external large-scale Visual Genome dataset additionally. To best utilize model capacity, the training data set is augmented so that both the train and val splits are used as the training set, result in about $1\%\sim 2\%$ overall accuracy improvement on the OE task. Also, to better understand the question semantics, pre-trained GloVe word vectors are concatenated with the learned word embedding. The MFB model corresponds to the MFB baseline model. The MFB+Att model indicates the model that replaces the MCB with our MFB in the MCB+Att model . The MFB+CoAtt model represents the network shown in Figure 3.

From Table 2, we have the following observations:

First, the model with MFB outperforms other comparative approaches significantly. The MFB baseline outperforms all other existing approaches without the attention mechanism for both the OE and MC tasks, and even surpasses some approaches with attention. When attention is introduced, MFB+Att consistently outperforms current next-best model MCB+Att, highlighting the efficacy and robustness of the proposed MFB.

Second, the co-attention model further improve the performance over the attention model with only considering the image attention. By introducing co-attention learning, MFB+CoAtt delivers a 0.5 $\%$ improvement on the OE task compared with the MFB+Att model in terms of overall accuracy, indicating the additional benefits of the co-attention learning framework.

Finally, with the external pre-trained GloVe model and the Visual Genome dataset, the performance of our models are further improved. The MFB+CoAtt+GloVe+VG model significantly outperforms the best reported results with a single model on both the OE and MC task.

In Table 3, we compare our approach with the state-of-the-art methods with model ensemble. Similar with , we train 7 individual MFB+CoAtt+GloVe models and average the prediction scores of them. Four of the seven models additionally introduce the Visual Genome dataset into the training set. For fair comparison, only the published results are demonstrated. From Table 3, the ensemble of MFB models outperforms the next best approach by 1.5 $\%$ on the OE task and by 2.2 $\%$ on the MC task respectively. Finally, compared with the results obtained by human, there is still a lot of room for the improvement to approach the human-level.

To better demonstrate the effects of co-attention learning, in Figure 5 we visualize the learned question and image attentions of some examples from the validation set. The examples are randomly picked from different question types. It can be seen that the learned question and image attentions are usually closely focus on the key words and the most relevant image regions. From the incorrect examples, we can also draw conclusions about the weakness of our approach, which are perhaps common to all VQA approaches: 1) some key words in the question are neglected by the question attention module, which seriously affects the learned image attention and final predictions (e.g., the word catcher in the first example and the word bottom in the third example); 2) even the intention of the question is well understood, some visual contents are still unrecognized (e.g., the flags in the second example) or misclassified (the meat in the fourth example), leading to the wrong answer for the counting problem. These observations are useful to guide further improvements for VQA in the future.

Conclusions

In this paper, we develop a Multi-modal Factorized Bilinear pooling (MFB) approach to fuse multi-modal features for the VQA task. Compared with existing bilinear pooling methods, the MFB approach achieves significant performance improvement for the VQA task. Based on MFB, we design a network architecture with co-attention learning that achieves new state-of-the-art performance on the real-world VQA dataset. This explorations of multi-modal bilinear pooling and co-attention learning are applicable to a wide range of tasks involving multi-modal data.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grants 61622205, 61602136 and 61472110, the Zhejiang Provincial Natural Science Foundation of China under Grant LR15F020002, the Australian Research Council under Project FL-170100117, DP-140102164, and LP-150100671.