Exploring Models and Data for Image Question Answering

Mengye Ren, Ryan Kiros, Richard Zemel

Introduction

Combining image understanding and natural language interaction is one of the grand dreams of artificial intelligence. We are interested in the problem of jointly learning image and text through a question-answering task. Recently, researchers studying image caption generation have developed powerful methods of jointly learning from image and text inputs to form higher level representations from models such as convolutional neural networks (CNNs) trained on object recognition, and word embeddings trained on large scale text corpora. Image QA involves an extra layer of interaction between human and computers. Here the model needs to pay attention to details of the image instead of describing it in a vague sense. The problem also combines many computer vision sub-problems such as image labeling and object detection.

In this paper we present our contributions to the problem: a generic end-to-end QA model using visual semantic embeddings to connect a CNN and a recurrent neural net (RNN), as well as comparisons to a suite of other models; an automatic question generation algorithm that converts description sentences into questions; and a new QA dataset (COCO-QA) that was generated using the algorithm, and a number of baseline results on this new dataset.

In this work we assume that the answers consist of only a single word, which allows us to treat the problem as a classification problem. This also makes the evaluation of the models easier and more robust, avoiding the thorny evaluation issues that plague multi-word generation problems.

Related Work

Malinowski and Fritz released a dataset with images and question-answer pairs, the DAtaset for QUestion Answering on Real-world images (DAQUAR). All images are from the NYU depth v2 dataset , and are taken from indoor scenes. Human segmentation, image depth values, and object labeling are available in the dataset. The QA data has two sets of configurations, which differ by the number of object classes appearing in the questions (37-class and 894-class). There are mainly three types of questions in this dataset: object type, object color, and number of objects. Some questions are easy but many questions are very hard to answer even for humans. Since DAQUAR is the only publicly available image-based QA dataset, it is one of our benchmarks to evaluate our models.

Together with the release of the DAQUAR dataset, Malinowski and Fritz presented an approach which combines semantic parsing and image segmentation. Their approach is notable as one of the first attempts at image QA, but it has a number of limitations. First, a human-defined possible set of predicates are very dataset-specific. To obtain the predicates, their algorithm also depends on the accuracy of the image segmentation algorithm and image depth information. Second, their model needs to compute all possible spatial relations in the training images. Even though the model limits this to the nearest neighbors of the test images, it could still be an expensive operation in larger datasets. Lastly the accuracy of their model is not very strong. We show below that some simple baselines perform better.

Very recently there has been a number of parallel efforts on both creating datasets and proposing new models . Both Antol et al. and Gao et al. used MS-COCO images and created an open domain dataset with human generated questions and answers. In Anto et al.’s work, the authors also included cartoon pictures besides real images. Some questions require logical reasoning in order to answer correctly. Both Malinowski et al. and Gao et al. use recurrent networks to encode the sentence and output the answer. Whereas Malinowski et al. use a single network to handle both encoding and decoding, Gao et al. used two networks, a separate encoder and decoder. Lastly, bilingual (Chinese and English) versions of the QA dataset are available in Gao et al.’s work. Ma et al. use CNNs to both extract image features and sentence features, and fuse the features together with another multi-modal CNN.

Our approach is developed independently from the work above. Similar to the work of Malinowski et al. and Gao et al., we also experimented with recurrent networks to consume the sequential question input. Unlike Gao et al., we formulate the task as a classification problem, as there is no single well- accepted metric to evaluate sentence-form answer accuracy . Thus, we place more focus on a limited domain of questions that can be answered with one word. We also formulate and evaluate a range of other algorithms, that utilize various representations drawn from the question and image, on these datasets.

Proposed Methodology

The methodology presented here is two-fold. On the model side we develop and apply various forms of neural networks and visual-semantic embeddings on this task, and on the dataset side we propose new ways of synthesizing QA pairs from currently available image description datasets.

In recent years, recurrent neural networks (RNNs) have enjoyed some successes in the field of natural language processing (NLP). Long short-term memory (LSTM) is a form of RNN which is easier to train than standard RNNs because of its linear error propagation and multiplicative gatings. Our model builds directly on top of the LSTM sentence model and is called the “VIS+LSTM” model. It treats the image as one word of the question. We borrowed this idea of treating the image as a word from caption generation work done by Vinyals et al. . We compare this newly proposed model with a suite of simpler models in the Experimental Results section.

We use the last hidden layer of the 19-layer Oxford VGG Conv Net trained on ImageNet 2014 Challenge as our visual embeddings. The CNN part of our model is kept frozen during training.

We experimented with several different word embedding models: randomly initialized embedding, dataset-specific skip-gram embedding and general-purpose skip-gram embedding model . The word embeddings are trained with the rest of the model.

We then treat the image as if it is the first word of the sentence. Similar to DeViSE , we use a linear or affine transformation to map 4096 dimension image feature vectors to a 300 or 500 dimensional vector that matches the dimension of the word embeddings.

We can optionally treat the image as the last word of the question as well through a different weight matrix and optionally add a reverse LSTM, which gets the same content but operates in a backward sequential fashion.

The LSTM(s) outputs are fed into a softmax layer at the last timestep to generate answers.

2 Question-Answer Generation

The currently available DAQUAR dataset contains approximately 1500 images and 7000 questions on 37 common object classes, which might be not enough for training large complex models. Another problem with the current dataset is that simply guessing the modes can yield very good accuracy.

We aim to create another dataset, to produce a much larger number of QA pairs and a more even distribution of answers. While collecting human generated QA pairs is one possible approach, and another is to synthesize questions based on image labeling, we instead propose to automatically convert descriptions into QA form. In general, objects mentioned in image descriptions are easier to detect than the ones in DAQUAR’s human generated questions, and than the ones in synthetic QAs based on ground truth labeling. This allows the model to rely more on rough image understanding without any logical reasoning. Lastly the conversion process preserves the language variability in the original description, and results in more human-like questions than questions generated from image labeling.

As a starting point we used the MS-COCO dataset , but the same method can be applied to any other image description dataset, such as Flickr , SBU , or even the internet.

We used the Stanford parser to obtain the syntatic structure of the original image description. We also utilized these strategies for forming the questions.

Compound sentences to simple sentences Here we only consider a simple case, where two sentences are joined together with a conjunctive word. We split the orginial sentences into two independent sentences.

Indefinite determiners “a(n)” to definite determiners “the”.

Wh-movement constraints In English, questions tend to start with interrogative words such as “what”. The algorithm needs to move the verb as well as the “wh-” constituent to the front of the sentence. For example: “A man is riding a horse” becomes “What is the man riding?” In this work we consider the following two simple constraints: (1) A-over-A principle which restricts the movement of a wh-word inside a noun phrase (NP) ; (2) Our algorithm does not move any wh-word that is contained in a clause constituent.

2.2 Question Generation

Question generation is still an open-ended topic. Overall, we adopt a conservative approach to generating questions in an attempt to create high-quality questions. We consider generating four types of questions below:

Object Questions: First, we consider asking about an object using “what”. This involves replacing the actual object with a “what” in the sentence, and then transforming the sentence structure so that the “what” appears in the front of the sentence. The entire algorithm has the following stages: (1) Split long sentences into simple sentences; (2) Change indefinite determiners to definite determiners; (3) Traverse the sentence and identify potential answers and replace with “what”. During the traversal of object-type question generation, we currently ignore all the prepositional phrase (PP) constituents; (4) Perform wh-movement. In order to identify a possible answer word, we used WordNet and the NLTK software package to get noun categories.

Number Questions: We follow a similar procedure as the previous algorithm, except for a different way to identify potential answers: we extract numbers from original sentences. Splitting compound sentences, changing determiners, and wh-movement parts remain the same.

Color Questions: Color questions are much easier to generate. This only requires locating the color adjective and the noun to which the adjective attaches. Then it simply forms a sentence “What is the color of the [object]” with the “object” replaced by the actual noun.

Location Questions: These are similar to generating object questions, except that now the answer traversal will only search within PP constituents that start with the preposition “in”. We also added rules to filter out clothing so that the answers will mostly be places, scenes, or large objects that contain smaller objects.

2.3 Post-Processing

We rejected the answers that appear too rarely or too often in our generated dataset. After this QA rejection process, the frequency of the most common answer words was reduced from 24.98% down to 7.30% in the test set of COCO-QA.

Experimental Results

Table 1 summarizes the statistics of COCO-QA. It should be noted that since we applied the QA pair rejection process, mode-guessing performs very poorly on COCO-QA. However, COCO-QA questions are actually easier to answer than DAQUAR from a human point of view. This encourages the model to exploit salient object relations instead of exhaustively searching all possible relations. COCO-QA dataset can be downloaded at http://www.cs.toronto.edu/~mren/imageqa/data/cocoqa

Here we provide some brief statistics of the new dataset. The maximum question length is 55, and average is 9.65. The most common answers are “two” (3116, 2.65%), “white” (2851, 2.42%), and “red” (2443, 2.08%). The least common are “eagle” (25, 0.02%) “tram” (25, 0.02%), and “sofa” (25, 0.02%). The median answer is “bed” (867, 0.737%). Across the entire test set (38,948 QAs), 9072 (23.29%) overlap in training questions, and 7284 (18.70%) overlap in training question-answer pairs.

2 Model Details

VIS+LSTM: The first model is the CNN and LSTM with a dimensionality-reduction weight matrix in the middle; we call this “VIS+LSTM” in our tables and figures.

2-VIS+BLSTM: The second model has two image feature inputs, at the start and the end of the sentence, with different learned linear transformations, and also has LSTMs going in both the forward and backward directions. Both LSTMs output to the softmax layer at the last timestep. We call the second model “2-VIS+BLSTM”.

IMG+BOW: This simple model performs multinomial logistic regression based on the image features without dimensionality reduction (4096 dimension), and a bag-of-word (BOW) vector obtained by summing all the learned word vectors of the question.

FULL: Lastly, the “FULL” model is a simple average of the three models above.

We release the complete details of the models at https://github.com/renmengye/imageqa-public.

3 Baselines

To evaluate the effectiveness of our models, we designed a few baselines.

GUESS: One very simple baseline is to predict the mode based on the question type. For example, if the question contains “how many” then the model will output “two.” In DAQUAR, the modes are “table”, “two”, and “white” and in COCO-QA, the modes are “cat”, “two”, “white”, and “room”.

BOW: We designed a set of “blind” models which are given only the questions without the images. One of the simplest blind models performs logistic regression on the BOW vector to classify answers.

LSTM: Another “blind” model we experimented with simply inputs the question words into the LSTM alone.

IMG: We also trained a counterpart “deaf” model. For each type of question, we train a separate CNN classification layer (with all lower layers frozen during training). Note that this model knows the type of question, in order to make its performance somewhat comparable to models that can take into account the words to narrow down the answer space. However the model does not know anything about the question except the type.

IMG+PRIOR: This baseline combines the prior knowledge of an object and the image understanding from the “deaf model”. For example, a question asking the color of a white bird flying in the blue sky may output white rather than blue simply because the prior probability of the bird being blue is lower. We denote $c$ as the color, $o$ as the class of the object of interest, and $x$ as the image. Assuming $o$ and $x$ are conditionally independent given the color,

This can be computed if $p(c|x)$ is the output of a logistic regression given the CNN features alone, and we simply estimate $p(o|c)$ empirically: $\hat{p}(o|c)=\frac{count(o,c)}{count(c)}$ . We use Laplace smoothing on this empirical distribution.

K-NN: In the task of image caption generation, Devlin et al. showed that a nearest neighbors baseline approach actually performs very well. To see whether our model memorizes the training data for answering new question, we include a K-NN baseline in the results. Unlike image caption generation, here the similarity measure includes both image and text. We use the bag-of-words representation learned from IMG+BOW, and append it to the CNN image features. We use Euclidean distance as the similarity metric; it is possible to improve the nearest neighbor result by learning a similarity metric.

4 Performance Metrics

To evaluate model performance, we used the plain answer accuracy as well as the Wu-Palmer similarity (WUPS) measure . The WUPS calculates the similarity between two words based on their longest common subsequence in the taxonomy tree. If the similarity between two words is less than a threshold then a score of zero will be given to the candidate answer. Following Malinowski and Fritz , we measure all models in terms of accuracy, WUPS 0.9, and WUPS 0.0.

5 Results and Analysis

Table 2 summarizes the learning results on DAQUAR and COCO-QA. For DAQUAR we compare our results with and . It should be noted that our DAQUAR results are for the portion of the dataset (98.3%) with single-word answers. After the release of our paper, Ma et al. claimed to achieve better results on both datasets.

From the above results we observe that our model outperforms the baselines and the existing approach in terms of answer accuracy and WUPS. Our VIS+LSTM and Malinkowski et al.’s recurrent neural network model achieved somewhat similar performance on DAQUAR. A simple average of all three models further boosts the performance by 1-2%, outperforming other models.

It is surprising to see that the IMG+BOW model is very strong on both datasets. One limitation of our VIS+LSTM model is that we are not able to consume image features as large as 4096 dimensions at one time step, so the dimensionality reduction may lose some useful information. We tried to give IMG+BOW a 500 dim. image vector, and it does worse than VIS+LSTM ( $\approx$ 48%).

By comparing the blind versions of the BOW and LSTM models, we hypothesize that in Image QA tasks, and in particular on the simple questions studied here, sequential word interaction may not be as important as in other natural language tasks.

It is also interesting that the blind model does not lose much on the DAQUAR dataset, We speculate that it is likely that the ImageNet images are very different from the indoor scene images, which are mostly composed of furniture. However, the non-blind models outperform the blind models by a large margin on COCO-QA. There are three possible reasons: (1) the objects in MS-COCO resemble the ones in ImageNet more; (2) MS-COCO images have fewer objects whereas the indoor scenes have considerable clutter; and (3) COCO-QA has more data to train complex models.

There are many interesting examples but due to space limitations we can only show a few in Figure 1 and Figure 3; full results are available at http://www.cs.toronto.edu/~mren/imageqa/results. For some of the images, we added some extra questions (the ones have an “a” in the question ID); these provide more insight into a model’s representation of the image and question information, and help elucidate questions that our models may accidentally get correct. The parentheses in the figures represent the confidence score given by the softmax layer of the respective model.

Model Selection: We did not find that using different word embedding has a significant impact on the final classification results. We observed that fine-tuning the word embedding results in better performance and normalizing the CNN hidden image features into zero-mean and unit-variance helps achieve faster training time. The bidirectional LSTM model can further boost the result by a little.

Object Questions: As the original CNN was trained for the ImageNet challenge, the IMG+BOW benefited significantly from its single object recognition ability. However, the challenging part is to consider spatial relations between multiple objects and to focus on details of the image. Our models only did a moderately acceptable job on this; see for instance the first picture of Figure 1 and the fourth picture of Figure 3. Sometimes a model fails to make a correct decision but outputs the most salient object, while sometimes the blind model can equally guess the most probable objects based on the question alone (e.g., chairs should be around the dining table). Nonetheless, the FULL model improves accuracy by 50% compared to IMG model, which shows the difference between pure object classification and image question answering.

Counting: In DAQUAR, we could not observe any advantage in the counting ability of the IMG+BOW and the VIS+LSTM model compared to the blind baselines. In COCO-QA there is some observable counting ability in very clean images with a single object type. The models can sometimes count up to five or six. However, as shown in the second picture of Figure 3, the ability is fairly weak as they do not count correctly when different object types are present. There is a lot of room for improvement in the counting task, and in fact this could be a separate computer vision problem on its own.

Color: In COCO-QA there is a significant win for the IMG+BOW and the VIS+LSTM against the blind ones on color-type questions. We further discovered that these models are not only able to recognize the dominant color of the image but sometimes associate different colors to different objects, as shown in the first picture of Figure 3. However, they still fail on a number of easy examples. Adding prior knowledge provides an immediate gain on the IMG model in terms of accuracy on Color and Number questions. The gap between the IMG+PRIOR and IMG+BOW shows some localized color association ability in the CNN image representation.

Conclusion and Current Directions

In this paper, we consider the image QA problem and present our end-to-end neural network models. Our model shows a reasonable understanding of the question and some coarse image understanding, but it is still very naïve in many situations. While recurrent networks are becoming a popular choice for learning image and text, we showed that a simple bag-of-words can perform equally well compared to a recurrent network that is borrowed from an image caption generation framework . We proposed a more complete set of baselines which can provide potential insight for developing more sophisticated end-to-end image question answering systems. As the currently available dataset is not large enough, we developed an algorithm that helps us collect large scale image QA dataset from image descriptions. Our question generation algorithm is extensible to many image description datasets and can be automated without requiring extensive human effort. We hope that the release of the new dataset will encourage more data-driven approaches to this problem in the future.

Image question answering is a fairly new research topic, and the approach we present here has a number of limitations. First, our models are just answer classifiers. Ideally we would like to permit longer answers which will involve some sophisticated text generation model or structured output. But this will require an automatic free-form answer evaluation metric. Second, we are only focusing on a limited domain of questions. However, this limited range of questions allow us to study the results more in depth. Lastly, it is also hard to interpret why the models output a certain answer. By comparing our models with some baselines we can roughly infer whether they understood the image. Visual attention is another future direction, which could both improve the results (based on recent successes in image captioning ) as well as help explain the model prediction by examining the attention output at every timestep.

We would like to thank Nitish Srivastava for the support of Toronto Conv Net, from which we extracted the CNN image features. We would also like to thank anonymous reviewers for their valuable and helpful comments.

References

Appendix A Supplementary Material

A.2 Post-Processing of COCO-QA Detail

First, answers that appear less than a frequency threshold are discarded. Second, we enroll a QA pair one at a time. The probability of enrolling the next QA pair $(q,a)$ is:

where $count(a)$ denotes the current number of enrolled QA pairs that have $a$ as the ground truth answer, and $K$ , $K_{2}$ are some constants with $K\leq K_{2}$ . In the COCO-QA generation we chose $K=100$ and $K_{2}=200$ .