Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

Duy-Kien Nguyen, Takayuki Okatani

Introduction

There has been a significant progress in the study of visual question answering (VQA) over a short period of time since its introduction, showing rapid boost of performance for common benchmark datasets. This progress has been mainly brought about by two lines of research, the development of better attention mechanisms and the improvement in fusion of features extracted from an input image and question.

Since introduced by Bahdanau et al. , attention has been playing an important role in solutions of various problems of artificial intelligence ranging from tasks using single modality (e.g., language, speech, and vision) to multimodal tasks. For VQA, attention on image regions generated from the input question was first introduced and then several extensions have been proposed . Meanwhile, researchers have proposed several methods for feature fusion , where the aim is to obtain better fused representation of image and question pairs. These studies updated the state-of-the-art for common benchmark datasets at the time of each publication.

We observe that these two lines of research have been independently conducted so far. This is particularly the case with the studies of feature fusion methods, where attention is considered to be optional, even though the best performance is achieved with it. However, we think that they are rather two different approaches towards the same goal. In particular, we argue that a better attention mechanism leads to a better fused representation of image-question pairs.

Motivated by this, we propose a novel co-attention mechanism for improved fusion of visual and language representations. Given representations of an image and a question, it first generates an attention map on image regions for each question word and an attention map on question words for each image region. It then performs computation of attended features, concatenation of multimodal representations, and their transformation by a single layer network with ReLU and a residual connection. These computations are encapsulated into a composite network that we call dense co-attention layer, since it considers every interaction between any image region and any question word. The layer has fully symmetric architecture between the two modalities, and can be stacked to form a hierarchy that enables multi-step interactions between the image-question pair.

Starting from initial representations of an input image and question, each dense co-attention layer in the layer stack updates the representations, which are inputted to the next layer. Its final output are then fed to a layer for answer prediction. We use additional attention mechanisms in the initial feature extraction as well as the answer prediction layer. We call the entire network including all these components the dense coattention network (DCN). We show the effectiveness of DCNs by several experimental results; they achieve the new state-of-the-art for VQA 1.0 and 2.0 datasets.

Related Work

In this section, we briefly review previous studies of VQA with a special focus on the developments of attention mechanisms and fusion methods.

Attention has proved its effectiveness on many tasks and VQA is no exception. A number of methods have been developed so far, in which question-guided attention on image regions is commonly used. They are categorized into two classes according to the type of employed image features. One is the class of methods that use visual features from some region proposals, which are generated by Edge Boxes or Region Proposal Network . The other is the class of methods that use convolutional features (i.e., activations of convolutional layers) .

There are several approaches to creation and use of attention maps. Yang et al. developed stacked attention network that produces multiple attention maps on the image in a sequential manner, aiming at performing multiple steps of reasoning. Kim et al. extended this idea by incorporating it into a residual architecture to produce better attention information. Chen et al. proposed a structured attention model that can encode cross-region relation, aiming at properly answering questions that involve complex inter-region relations.

Earlier studies mainly considered question-guided attention on image regions. In later studies, the opposite orientation of attention, i.e., image-guided attention on question words, is considered additionally. Lu et al. introduced the co-attention mechanism that generates and uses attention on image regions and on question words. To reduce the gap of image and question features, Yu et al. utilized attention to extract not only spatial information but also language concept of the image. Yu et al. combined the mechanism with a novel multi-modal feature fusion of image and question.

We point out that the existing attention mechanisms only consider a limited amount of possible interactions between image regions and question words. Some consider only attention on image regions from a whole question. Co-attention additionally considers attention on question words but it is created from a whole image. We argue that this can be a significant limitation of the existing approaches. The proposed mechanism can deal with every interaction between any image region and any question word, which possibly enables to model unknown complex image-question relations that are necessary for correctly answering questions.

2 Multimodal Feature Fusion

The common framework of existing methods is that visual and language features are independently extracted from the image and question at the initial step, and they are fused at a later step to compute the final prediction. In early studies, researchers employed simple fusion methods such as the concatenation, summation, and element-wise product of the visual and language features, which are fed to fully connected layers to predict answers.

It was then shown by Fukui et al. that a more complicated fusion method does improve prediction accuracy; they introduced the bilinear (pooling) method that uses an outer product of two vectors of visual and language features for their fusion. As the outer product gives a very high-dimensional feature, they adopt the idea of Gao et al. to compress the fused feature and name it the Multimodal Compact Bilinear (MCB) pooling method. However, the compacted feature of the MCB method still tends to be high-dimensional to guarantee robust performance, Kim et al. proposed low-rank bilinear pooling using Hadamard product of two feature vectors, which is called the Multimodal Low-rank Bilinear (MLB) pooling. Pointing out that MLB suffers from slow convergence rate, Yu et al. proposed the Multi-modal Factorized Bilinear (MFB) pooling, which computes a fused feature with a matrix factorization trick to reduce the number of parameters and improve convergence rate.

The attention mechanisms can also be considered feature fusion methods, regardless of whether it is explicitly mentioned, since they are designed to obtain a better representation of image-question pairs based on their interactions. This is particularly the case with co-attention mechanisms in which the two features are treated symmetrically. Our dense co-attention network is based on this observation. It fuses the two features by multiple applications of the attention mechanism that can use more fine-grained interactions between them.

Dense Co-Attention Network (DCN)

In this section, we describe the architecture of DCNs; see Fig.1 for its overview. It consists of a stack of dense co-attention layers that fuses language and visual features repeatedly, on top of which an answer prediction layer that predict answers in a multi-label classification setting . We first explain the initial feature extraction from the input question and image (Sec.3.1) and then describe the dense co-attention layer (Sec.3.2) and the answer prediction layer (Sec.3.3).

We employ pretrained networks that are commonly used in previous studies for encoding or extracting features from images, questions, and answers, such as pretrained ResNet with some differences from earlier studies.

We use bi-directional LSTM for encoding questions and answers. Specifically, a question consisting of NN words is first converted into a sequence {e1Q,...,eNQ}\{e^{Q}_{1},...,e^{Q}_{N}\} of GloVe vectors , which are then inputted into a one-layer bi-directional LSTM (Bi-LSTM) with a residual connection as

We follow a similar procedure to encode answers. An answer of MM words is converted into {e1A,...,eMA}\{e^{A}_{1},...,e^{A}_{M}\} and then inputted to the same Bi-LSTM, yielding the hidden states am\overrightarrow{a_{m}} and am\overleftarrow{a_{m}} (m=1,,M)(m=1,\ldots,M). We will use sA=[aM,a1]s_{A}=[\overrightarrow{a_{M}}^{\top},\overleftarrow{a_{1}}^{\top}]^{\top} for answer representation.

1.2 Image Representation

As in many previous studies, we use a pretrained CNN (i.e., a ResNet with 152 layers pretrained on ImageNet) to extract visual features of multiple image regions, but our extraction method is slightly different. We extract features from four conv. layers and then use a question-guided attention on these layers to fuse their features. We do this to exploit the maximum potential of the subsequent dense co-attention layers. We conjecture that features at different levels in the hierarchy of visual representation will be necessary to correctly answer a wide range of questions.

To be specific, we extract outputs from the four conv. layers (after ReLU) before the last four pooling layers. These are tensors of different sizes (i.e., 256×112×112256\times 112\times 112, 512×56×56,1024×28×28512\times 56\times 56,1024\times 28\times 28, and 2048×14×142048\times 14\times 14) and are converted into tensors of the same size (d×14×14)(d\times 14\times 14) by applying max pooling with a different pooling size and one-by-one convolution to each. We also apply l2l_{2} normalization on the depth dimension of each tensor as in . We reshape the normalized tensors into four d×Td\times T matrices, where T=14×14T=14\times 14.

Next, attention on the four layers is created from sQs_{Q}, the representation of the whole question defined above. We use a two-layer neural network having 724 hidden units with ReLU non-linearity to project sQs_{Q} to the scores of the four layers as

which are then normalized by softmax to obtain four attention weights α1,,α4\alpha_{1},\ldots,\alpha_{4}. The weighted sum of the above four matrices is computed, yielding a d×Td\times T matrix V=[v1,...,vT]V=[v_{1},...,v_{T}], which is our representation of the input image. It stores the image feature at the tt-th image region in its tt-th column vector of size dd.

2 Dense Co-Attention Layer

The proposed architecture has the following properties. First, it is a co-attention mechanism . Second, the co-attention is dense in the sense that it considers every interaction between any word and any region. To be specific, our mechanism creates one attention map on regions per each word and creates one attention map on words per each region (see Fig.3). Third, it can be stacked as shown in Fig.1.

2.2 Dense Co-attention Mechanism

For the sake of explanation, we first explain the basic method for creation of attention maps, which we will extend later. Given QlQ_{l} and VlV_{l}, two attention maps are created as shown in Fig.3. Their computation starts with the affinity matrix

where WlW_{l} is a learnable weight matrix. We normalize AlA_{l} in row-wise to derive attention maps on question words conditioned by each image region as

and also normalize AlA_{l} in column-wise to derive attention maps on image regions conditioned by each question word as

Note that each row of AQlA_{Q_{l}} and AVlA_{V_{l}} contains a single attention map.

In several studies , multiple attention maps are created and applied to target features in a parallel manner, which provides multiple attended features, and then they are fused by concatenation. In , features are first linearly projected to multiple lower-dimensional spaces, for each of which the above attention function is performed. We adopt a similar approach that uses multiple attention maps here, but we use average instead of concatenation for fusion of the multiple attended features, because we found it works better in our case.

Attention maps are created from each affinity matrix by column-wise and row-wise normalization as

As we employ multiplicative (or dot-product) attention as explained below, average fusion of multiple attended features is equivalent to averaging our attention maps as

respectivelyThe notation (1 ⁣: ⁣T,:)\mathtt{(1\!:\!T,:\,)} indicates the submatrix in the first TT rows, as in Python.. Note that Q^l\hat{Q}_{l} is the same size as VlV_{l} (i.e. d×Td\times T) and V^l\hat{V}_{l} is the same size as QlQ_{l} (i.e. d×Nd\times N).

2.3 Fusing Image and Question Representations

After computing the attended feature representations Q^l\hat{Q}_{l} and V^l\hat{V}_{l}, we fuse the image and question representations, as shown in the right half of Fig.2. The matrix V^l\hat{V}_{l} stores in its nn-th column the attended representation of the entire image conditioned on the nn-th question word. Then, the nn-th column vector v^ln\hat{v}_{ln} is fused with the representation qlnq_{ln} of nn-th question word by concatenation to form 2d2d-vector [qln,v^ln][q_{ln}^{\top},\hat{v}_{ln}^{\top}]^{\top}. This concatenated vector is projected back to a dd-dimensional space by a single layer network followed by the ReLU activation and residual connection as

Similarly, the representation vltv_{lt} of tt-th image region is concatenated with the representation q^lt\hat{q}_{lt} of the whole question words conditioned on the tt-th image region, and then projected back to a dd-dimensional space as

It should be noted that the above two fully-connected networks have different parameters (i.e., WQlW_{Q_{l}}, WVlW_{V_{l}} etc.) for each layer ll.

3 Answer Prediction

Given the final outputs QLQ_{L} and VLV_{L} of the last dense co-attention layer, we predict answers. As they contain the representation of NN question words and TT image regions, we first perform self-attention function on each of them to obtain aggregated representations of the whole question and image. This is done for QLQ_{L} as follows: i) compute ‘scores’ sqL1,,sqLNs_{q_{L1}},\ldots,s_{q_{LN}} of qL1,,qLNq_{L1},\ldots,q_{LN} by applying an identical two-layer MLP with ReLU nonlinearity in its hidden layer; ii) then apply softmax to them to derive attention weights α1Q,,αNQ\alpha^{Q}_{1},\ldots,\alpha^{Q}_{N}; and iii) compute an aggregated representation by sQL=n=1NαnQqLns_{Q_{L}}=\sum_{n=1}^{N}\alpha^{Q}_{n}q_{Ln}. Following the same procedure with an MLP with different weights, we derive attention weights α1V,,αTV\alpha^{V}_{1},\ldots,\alpha^{V}_{T} and then compute an aggregated representation sVLs_{V_{L}} from vL1,,vLTv_{L1},\ldots,v_{LT}.

Using sQLs_{Q_{L}} and sVLs_{V_{L}} thus computed, we predict answers. We consider three methods to do this here. The first one is to compute inner product between the sum of sQLs_{Q_{L}} and sVLs_{V_{L}} and sAs_{A}, the answer representation defined in Sec.3.1.1, as

where σ\sigma is the logistic function and WW is a learnable weight matrix. The second and third ones are to use a MLP for computing scores for a set of predefined answers, which is a widely used approach in recent studies. The two differ in how to fuse sQLs_{Q_{L}} and sVLs_{V_{L}}, i.e., summation

where MLP is a two layer MLP having 1024 hidden units with ReLU non-linearity. The first one is the most flexible, as it allows us to deal with any answers that are not considered at the time of training the entire network.

Experiments

In this section, we present results of the experiments conducted to evaluate the proposed method.

We used two most popular datasets, VQA and VQA 2.0 , for our experiments. VQA (also known as VQA 1.0) contains human-annotated question-answer pairs on 204,721 images from Microsoft COCO dataset . There are three predefined splits of questions, train, val and test or test-standard, which consist of 248,349, 121,512, and 244,302 questions, respectively. There is also a 25%25\% subset of the test-standard set referred to as test-dev. All the questions are categorized into three types: yes/no, number, and other. Each question has 10 free-response answers. VQA 2.0 is an updated version of VQA 1.0 and is the largest as of now. Compared with VQA 1.0, it contains more samples (443,757 train, 214,354 val, and 447,793 test questions), and is more balanced in term of language bias. We evaluate our models on the challenging Open-Ended task of both datasets.

As in , we choose correct answers appearing more than 55 times for VQA and 88 times for VQA 2.0 to form the set of candidate answers. Following previous studies, we train our network on train + val splits and report the test-dev and test-standard results from the VQA evaluation server (except for the ablation test shown below). We use the evaluation protocol of in all the experiments.

2 Experimental Setup

For both of the datasets, we use the Adam optimizer with the parameters α=0.001\alpha=0.001, β1=0.9\beta_{1}=0.9, and β2=0.99\beta_{2}=0.99. During the training procedure, we make the learning rate (α)(\alpha) decay at every 4 epochs for VQA and 7 epochs for VQA 2.0 with an exponential rate of 0.5. All models are trained up to 16 and 21 epochs on VQA and VQA 2.0, respectively. To prevent overfitting, dropouts are used after each fully connected layers with a dropout ratio p=0.3p=0.3 and after the LSTM with a dropout ratio p=0.1p=0.1. The batch size is set to 160 and 320 for VQA and VQA 2.0. We set the dimension dd of the feature space in the dense co-attention layers (equivalently, the size of its hidden layers) to be 1024.

3 Ablation Study

The architecture of the proposed DCN is composed of multiple modules. To evaluate the contribution of each module to final prediction accuracy, we conducted ablation tests. Using VQA 2.0, we evaluated several versions of DCNs with different parameters and settings by training them on the train split and calculating its performance on the val split. The results are shown in Table 1.

The first block of the table shows the effects of image-question co-attention. The numbers are performances obtained by a DCN with only question-guided attention on images (IQ)(I\leftarrow Q), with only image-guided attention on question words (IQ)(I\rightarrow Q), and the standard DCN with co-attention (IQ)(I\leftrightarrow Q). The single-direction variants generates only attention in either side of the two paths in the dense co-attention layer; the rest of the computations remain the same. The network with co-attention performs the best, verifying the effectiveness of our co-attention implementation.

The second block of the table shows the impacts of KK, which is the row size of MQlM_{Q_{l}} and MVlM_{V_{l}} that are used for augmenting QlQ_{l} and VlV_{l}. This augmentation is originally introduced to be able to deal with “nowhere to attend”, which can be implemented by K=1K=1 . However, we found that the use of K>1K>1 improves performance to a certain extent, which we think is because MQlM_{Q_{l}} and MVlM_{V_{l}} work as external memory that can be used through attention mechanism . As shown in the table, K=3K=3 yields the best performance.

The third and fourth blocks of the table show choices of the number hh of parallel attention maps and LL of stacked layers. The best result was obtained for h=4h=4 and L=3L=3.

The last two blocks of the table show effects of the use of attention in the answer prediction layer and the image extraction layer; the use of attention improves accuracy by about 1.3%1.3\% and 0.5%0.5\%, respectively.

4 Comparison with Existing Methods

Table 2 shows the performance of our method on VQA 1.0 along with published results of others. The entries ‘DCN (nn)’ indicate which method for score computation is employed from (16)-(18). It is seen from the table that our method outperforms the best published result (MF-SIG-T3) by a large margin of 0.9%1.1%0.9\%\sim 1.1\% on both test-dev and test-standard sets. Furthermore, the improvements can be seen in all of the entries (Other with 1.1%1.1\%, Number with 3.4%3.4\%, Yes/No with 0.6%0.6\% on test-standard set) implying the capacity of DCNs to model multiple types of complex relations between question-image pairs. Notably, we achieve significant improvements of 3.0%3.0\% and 3.4%3.4\% for the question type Number on test-dev and test-standard sets, respectively.

Table 2 also shows the performances of DCNs with a different answer prediction layer that uses (16), (17), and (18) for score computation. It is seen that (17) shows at least comparable performance to the others and even attains the best performance of 67.02%67.02\% in test-standard set.

Table 3 shows comparisons of our method to previous published results on VQA 2.0 and also that of the winner of VQA 2.0 Challenge 2017 in both test-dev and test-standard sets. It is observed in Table 3 that our approach outperforms the state-of-the-art published method (MF-SIG-T3) by a large margin of 2.1%2.1\% on test-dev set, even though the MF-SIG-T3 model was trained with VQA 2.0 and an augmented dataset (Visual Genome ). It is noted that the improvements are seen in all the question types (Other with 1.71%1.71\%, Number with 3.66%3.66\%, and Yes/No with 2.41%2.41\%). Comparing our DCN with the winner of VQA 2.0 Challenge 2017, Adelaide model. Our best DCN (17) delivers 1.5%1.5\% and 1.37%1.37\% improvements in every question types over the Adelaide+Detector on test-dev and test-standard, respectively. It is worth to point out that the winner method uses a detector (Region Proposal Network) trained on annotated regions of Visual Genome dataset to extract visual features; and that the model is trained using also an external dataset, i.e., the Visual Genome question answering dataset.

It should also be noted that while achieving the best performance in VQA dataset, the size of the DCNs (i.e., the number of parameters) is comparable or even smaller than the former state-of-the-art methods, as shown in Table 4.

5 Qualitative Evaluation

Complementary image-question pairs are available in VQA 2.0 , which are pairs of the same question and different images with different answers. To understand the behaviour of the trained DCN, we visualize attention maps that the DCN generates for some of the complementary image-question pairs. Specifically, we show multiplication of an input image and question with their attention maps α1V,,αTV\alpha^{V}_{1},\ldots,\alpha^{V}_{T} and α1Q,,αNQ\alpha^{Q}_{1},\ldots,\alpha^{Q}_{N} (defined in Sec.3.3) generated in the answer prediction layer. A typical example is shown in Fig.4. Each row shows the results for two pairs of the same question and different images, from which we can observe that the DCN is able to look at right regions to find the correct answers. Then, the first column shows the results for two pairs of the same image and different questions. It is observed that the DCN focuses on relevant image regions and question words to produce answers correctly. More visualization results including failure cases are provided in the supplementary material.

Conclusion

In this paper, we present a novel network architecture for VQA named the dense co-attention network. The core of the network is the dense co-attention layer, which is designed to enable improved fusion of visual and language representations by considering dense symmetric interactions between the input image and question. The layer can be stacked to perform multi-step image-question interactions. The layer stack combined with the initial feature extraction step and the final answer prediction layer, both of which have their own attention mechanisms, form the dense co-attention network. The experimental results on two datasets, VQA and VQA 2.0, confirm the effectiveness of the proposed architecture.

Acknowledgement

This work was partly supported by JSPS KAKENHI Grant Number JP15H05919, JST CREST Grant Number JPMJCR14D1, Council for Science, Technology and Innovation (CSTI), Cross-ministerial Strategic Innovation Promotion Program (Infrastructure Maintenance, Renovation and Management ), and the ImPACT Program “Tough Robotics Challenge” of the Council for Science, Technology, and Innovation (Cabinet Office, Government of Japan).

References

Appendix A More Details of the Experimental Setups

In our experiments, images and questions are preprocessed as follows. All the images were resized to 448×448448\times 448 before feeding into the CNN. All the questions were tokenized using Python Natural Language Toolkit (nltk). We used the vocabulary provided by the CommonCrawl-840B Glove model for English word vectors , and set out-of-vocabulary words to unk. As mentioned in the main paper, we chose the correct answer appearing more than 5 times (= 3,014 answers) for VQA 1.0, and 8 times (= 3,113 answers) for VQA 2.0 as in . We capped the maximum length of questions at 14 words and then performed dynamic unrolling for each question to allow for questions of different lengths.

Throughout the experiments, we used three-layer DCNs, that is, DCNs with three dense co-attention layers (L = 3). This number of layers were chosen based on our preliminary experiments. The Bi-LSTM was initialized following the recommendation in and all the other parameters were initialized as suggested by Glorot et al. In the training procedure, the ADAM optimizer was used to train our model for 16 and 21 epochs on VQA and VQA 2.0 with batch size of 160 and 320, respectively; weight decay with rate of 0.0001 was added. We used exponential decay to gradually decrease the learning rate as

where the initial learning rate α\alpha was set to α=0.001\alpha=0.001, and the decay epochs was set to 4 and 7 epochs for VQA and VQA 2.0 in turn; we set β1=0.9\beta_{1}=0.9, and β2=0.99\beta_{2}=0.99.

Appendix B Effects of the Employment of Contextualized Word Vectors

To extract word features from input questions, some of the previous studies employed pretrained RNNs (specifically, GRU networks pre-trained with Skip-thought) . In this study, we initially pursued a similar approach; we perform fine-tuning of a pretrained LSTM, specifically a two-layer Bi-LSTM trained as a CoVe (Context Vector) encoder . Conducting comparative experiments, we eventually employ a single-layer Bi-LSTM with random initialization, as explained in the main paper. We report here the results of the experiments.

Table 5 shows the performances of DCNs with the CoVe-pretrained Bi-LSTM and with the randomly initialized Bi-LSTM. Note that the former is a two-layer model and the later has only one layer. Here, the VQA 2.0 test-dev dataset was used. It is observed that for DCNs with the answer prediction layer of (16), the one with the CoVe-pretrained model performs slightly better than the one with the randomly initialized model, but their differences are small. For DCNs with the answer prediction layers of (17) and (18), the one with the randomly initialized model performs better with a less number of parameters.

It should be noted, however, that the employment of CoVe-pretrained models, together with the answer prediction layer of (16), enables to compute meaningful answer representation (sA)(s_{A}) for answers that have not been seen before, i.e., those that are not included in training data. Table 6 shows the results of DCN (16) with the CoVe-pretrained model for Multiple Choice answers, which include a lot of unseen answers. This is not the case with DCNs (17) and (18) that compute scores of a fixed set of predetermined answers— the common approach of most of the recent studies.

Appendix C Visualization of Attention Maps in the Answer Prediction Layer

We have shown a few examples of attention maps generated in the answer prediction layer of our DCNs in Fig.4 of the main paper. We show here more examples for success cases (Sec.C.1) and also for failure cases (Sec.C.2).

We consider the visualization of complementary pairs to analyze the behaviour of our DCNs. Each row shows a complementary pair having the same question and different images. It can be seen from the examples shown below that the image and question attention maps are generated appropriately for most of success cases.

C.2 Failure Cases

According to our analysis, failure cases can be categorized into the following four types:

Although the DCN is able to locate appropriate image regions and words, it fails to distinguish two different objects or concepts that have similar appearance. This may be attributable to that the extracted image features are not rich enough to distinguish them (e.g. mutt and lab; and spoon and fork).

Although the DCN is able to locate appropriate image regions and words, it fails to yield correct answers due to the bias of the dataset or missing instances of some objects/concepts in the dataset. For example, there are many samples of an american flag but no sample of a dragon flag in the training set.

The DCN fails to locate appropriate image regions. This tends to occur when some image regions have similar appearance to the region that the DCN should attend, or the region that it should attend is too small.

Although the DCN does yield conceptually correct answers, they are not listed in the given set of answers in the dataset and thus judged incorrect. For instance, while the given correct answer is water, the DCN outputs beach, which should also be correct, as in one of the examples below.

As in the above success cases, each row shows a complementary pair having the same question and different images. In each row, at least either one of the two has an erroneous prediction. The red bounding boxes indicate erroneous answers and the green ones indicate correct answers. The numbers in the failure examples indicate the error types we categorize above.

Appendix D Layer Attention in the Image Feature Extraction Step

As explained in the main paper (Sec.3.1), our DCN extracts visual features from an input image using a pre-trained ResNet at the initial step. The features are obtained by computing the weighted sum of the activations (i.e., outputs) of the four convolutional layers of the ResNet, where the attention weights generated conditioned on the input question are used. We examine here how this attention mechanism works for different types of questions. Specifically, utilizing the fifty five question types provided in the VQA-2.0, we compute the mean and standard deviation of the four attention weights for the questions belonging to each question type. We used all the questions in the validation set and our DCN trained only on train set for this computation.

Figure 6 shows the results. The bars in four colors represent the means of the four layer weights for each question type, and the thin black bars attached to the color bars represent their standard deviations. The fifty five question types are ordered by their similarity in the horizontal axis. From the plot, we can make the following observations:

Layer 1 (the lowest one) has a certain level of weights only for Yes/No questions (shown on about the left half of the plots) and no weight for other types of questions (on the right half);

Layer 2 has a small weight only for Yes/No questions and no weight for other types of questions;

Layer 3 tends to have large weights for questions about colors (e.g., “what color”) and questions about presence of a given object(s) (e.g., “are there” and “how many”);

Layer 4 (the highest one) has the largest attention weights in most of the question types, indicating its importance in answering them.

Specific questions, such as “what color ” and “what sport is”, tend to have smaller standard deviations than nonspecific questions, such as “is the woman” and “do you”.

References