MarioQA: Answering Questions by Watching Gameplay Videos

Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, Bohyung Han

Introduction

While deep convolutional neural networks trained on large-scale datasets have been making significant progress on various visual recognition problems, most of these tasks focus on the recognition in the same or similar levels, e.g., objects , scenes , actions , attributes , face identities , etc. On the other hand, image question answering (ImageQA) addresses a holistic image understanding problem, and handles diverse recognition tasks in a single framework. The main objective of ImageQA is to find an answer relevant to a pair of an input image and a question by capturing information in various semantic levels. This problem is often formulated with deep neural networks, and has been successfully investigated thanks to advance of representation learning techniques and release of outstanding pretrained deep neural network models .

Video question answering (VideoQA) is a more challenging task and is recently introduced as a natural extension of ImageQA in . In VideoQA, it is possible to ask a wide range of questions about temporal relationship between events such as dynamics, sequence, and causality. However, there is only limited understanding about how effective VideoQA models in capturing various information from videos, which is partly because there is no proper framework including dataset to analyze models. In other words, it is not straightforward to identify main reason of failure in VideoQA problems—dataset vs. trained model.

There exist a few independent datasets for VideoQA , but they are not well-organized enough to estimate capacity and flexibility of models accurately. For example, answering questions in MovieQA dataset requires too high-level understanding about movie contents, which is almost impossible to extract from visual cues only (e.g., calling off one’s tour, corrupt business, and vulnerable people) and consequently needs external information or additional modalities. On the other hand, questions in other datasets often rely on static or time-invariant information, which allows to find answers by observing a single frame with no consideration of temporal dependency.

To facilitate understanding of VideoQA models, we introduce a novel analysis framework, where we generate a customizable dataset using Super Mario video gameplays, referred to as MarioQA. Our dataset is automatically generated from gameplay videos and question templates to contain desired properties for analysis. We employ the proposed framework to analyze the impact of a properly constructed dataset to answering questions, where we are particularly interested in the questions related to temporal reasoning of events. The generated dataset consists of three subsets, each of which contains questions with a different level of difficulty in temporal reasoning. Note that, by controlling complexity of questions, individual models can be trained and evaluated on different subsets. Due to its synthetic nature, we can eliminate ambiguity in answers, which is often problematic in existing datasets, and make evaluation more reliable.

Our contribution is three-fold as summarized below:

We propose a novel framework for analyzing VideoQA models, where a customized dataset is automatically generated to have desired properties for target analysis.

We generate a synthetic VideoQA dataset, referred to as MarioQA, using Super Mario gameplay videos with their logs and a set of predefined templates to understand temporal reasoning capability of models.

We present the benefit of our framework to facilitate analysis of algorithms and show that a properly generated dataset is critical to performance improvement.

The rest of the paper is organized as follows. We first review related work in Section 2. Section 3 and 4 present our analysis framework with a new dataset, and discuss several baseline neural models, respectively. We analyze the models in Section 5, and conclude our paper in Section 6.

Related Work

The proposed method provides a framework for VideoQA model analysis. Likewise, there have been several attempts to build synthetic testbeds for analyzing QA models . For example, bAbI constructs a testbed for textual QA analysis with multiple synthetic subtasks, each of which focuses on a single aspect in textual QA problem. The datasets are generated by simulating a virtual world given a set of actions by virtual actors and a set of constraints imposed on the actors. For ImageQA, CLEVR dataset is recently released to understand visual reasoning capability of ImageQA models. It aims to analyze how well ImageQA models generalizes compositionality of languages. Images in CLEVR are synthetically generated by randomly sampling and rendering multiple predetermined objects and their relationships.

2 VideoQA Datasets

Zhu et al. constructed a VideoQA dataset in three domains using existing videos with grounded descriptions, where the domains include cooking scenarios , movie clips and web videos . The fill-in-the-blank questions are automatically generated by omitting a phrase (a verb or noun phrase) from the grounded description and the omitted phrase becomes the answer. For evaluation, the task is formed as answering multiple choice questions with four answer candidates. Similarly, introduces another VideoQA dataset with automatically generated fill-in-the-blank questions from LSMDC movie description dataset. Although the task for these datasets has a clear evaluation metric, the evaluation is still based on exact word matching rather than matching their semantics. Moreover, the questions can be answered by simply observing any single frame without need for temporal reasoning.

MovieQA dataset is another public benchmark of VideoQA based on movie clips. This dataset contains additional information in other modalities including plot synopses, subtitles, DVS and scripts. The question and answer (QA) pairs are manually annotated based on the plot synopses without watching movies. The tasks in MovieQA dataset are difficult because the most questions are about the story of movies rather than about the visual contents of video clips. Hence, it needs to refer to external information other than the video clips, and not appropriate to evaluate trained models in terms of video understanding capability.

Contrary to these datasets, MarioQA is composed of videos with multiple events and event-centric questions. Data in MarioQA require video understanding over multiple frames for reasoning temporal relationship between events but do not need extra information to find answers.

3 Image and Video Question Answering

Because ImageQA needs to handle two input modalities, i.e., image and question, presents a method of fusing two modalities to obtain rich multi-modal representations. To handle the diversity of target tasks in ImageQA, and propose adaptive architecture design and parameter setting techniques, respectively. Since questions often refer to particular objects within input images, many networks are designed to attend to relevant regions only . However, it is not straightforward to extend the attention models learned in ImageQA to VideoQA tasks due to additional temporal dimension in videos.

In MovieQA , questions depend heavily on the textual information provided with movie clips, so models are interested in embedding the multi-modal inputs on a common space. Video features are obtained by simply average-pooling image features of multiple frames. On the other hand, employs gated recurrent units (GRU) for sequential modeling of videos instead of simple pooling methods. In addition, unsupervised feature learning is performed to improve video representation power. However, none of these methods explore attention models although visual attention turns out to be effective in ImageQA .

VideoQA Analysis Framework

This section describes our VideoQA analysis framework for temporal reasoning capability using MarioQA dataset.

MarioQA is a new VideoQA dataset in which videos are recorded from gameplays and questions are about the events occurring in the videos. We use the Infinite Mario Bros.https://github.com/cflewis/Infinite-Mario-Bros game, which is a variant of Super Mario Bros. with endless random level generation, to collect video clips with event logs and generate QA pairs automatically from extracted events using manually constructed templates. Our dataset mainly contains questions about temporal relationships of multiple events to analyze temporal reasoning capability of models. Each example consists of a $240\times 320$ video clip containing multiple events and a question with corresponding answer. Figure 1 illustrates our data collection procedure.

We build the dataset based on two design principles to overcome the existing limitations. First, the dataset is aimed to verify model’s temporal reasoning capability of events in videos. To focus on this main issue, we remove questions that require additional or external information to return correct answers and highlight model capacity for video understanding. Second, given a question, the answer should be clear and unique to ensure meaningful evaluation and interpretation. Uncertainty in answers, ambiguous linguistic structure of questions, and multiple correct answers may result in inconsistent or even wrong analysis, and make algorithms stuck in local optima.

We choose gameplay videos as our video domain due to the following reasons. First, we can easily obtain a large amount of videos that contain multiple events with their temporal dependency. Second, learning complete semantics in gameplay videos is relatively easy compared to other domains due to their representation simplicity. Third, occurrence of an event is clear and there is no perceptual ambiguity. In real videos, answers for a question may be diverse depending on annotators because of subjective perception of visual information. On the contrary, we can simply access the oracle within the code to find answers.

We extract 11 distinct events $E=\text{\{kill, }$ die, jump, hit, break, appear, shoot, throw, kick, hold, eat} with their arguments, e.g., agent, patient and instrument from gameplays. For each extracted event as a target, we randomly sample video clips containing the target event with duration of 3 to 6 seconds. We then check the uniqueness of the target event within the sampled clip. For instance, the event kill with its arguments PGoomba and stomping is a unique target event among 8 extracted events in the video clip of Figure 1. This uniqueness check process rejects questions involving multiple answers since they cause ambiguity in evaluation.

Once the video clips of unique events are extracted, we generate QA pairs from the extracted events. We randomly eliminate one of the event arguments to form a question semantic chunk and generate a question from the question semantic chunk using predefined question templates. For example, an argument PGoomba is removed to form a question semantic chunk, kill(?, stomping), from the event kill(PGoomba, stomping) in Figure 1. Then, a question template ‘What enemy did Mario kill arg1 ?’ is selected from the template pool for the question semantic chunk. Finally, a question is generated by filling the template with a phrase ‘by stomping’, which linguistically realizes an argument, stomping. We use the template-based question generation because it allows to control level of semantics required for question answering. When annotators are told to freely create questions, it is hard to control the required level of semantics. So, we ask human annotators to create multiple linguistic realizations of a question semantic chunk. After question generation, a corresponding answer is also generated from the eliminated event argument, i.e., Para Goomba in the above example. Note that the dataset is easily customized by updating QA templates to further reflect analytical perspectives and demands.

2 Characteristics of Dataset for Model Analysis

There are three question types in MarioQA to maintain diversity. The followings are examples of event-centric, counting and state questions, respectively: ‘What did Mario hit before killing Goomba?’, ‘How many coins did Mario eat after a Red Koopa Paratroopa appears?’ and ‘What was Mario’s state when Green Koopa Paratroopa appeared?’ While questions in the three types generally require observation over multiple frames to find answers, a majority of state questions just need a single frame observation about objects and/or scene due to the uniqueness of the state within a clip.

As seen in the above examples, multiple events in a single video may be temporally related to each other, and understanding such temporal dependency is an important aspect in VideoQA. In spite of importance of this temporal dependency issue in videos, it has not been explored explicitly due to lack of proper datasets and complexity of tasks. Thanks to the synthetic property, we can generate questions about temporal relationships conveniently.

We construct MarioQA dataset with three subsets, which contain questions with different characteristics in temporal relationships: questions with no temporal relationship (NT), with easy temporal relationship (ET) and with hard temporal relationships (HT). NT asks questions about unique events in the entire video without any temporal relationship phrase. ET and HT have questions with temporal relationships in different levels of difficulty. While ET contains questions about globally unique events, HT involves distracting events making a VQA system choose a right answer out of multiple identical events using temporal reasoning; for a target event kill(PGoomba, stomping), any kill(*, *) events in the same video clip are considered as distracting events. Note that the answer of a question ’How many times did Mario jump after throwing a shell?’ about the video clip in Figure 1 is not 3 but 2 due to its temporal constraint. Note that the generated questions are still categorized into one of three types—event-centric, counting and state questions.

3 Dataset Statistics

From a total of 13 hours of gameplays, we collect 187,757 examples with automatically generated QA pairs. There are 92,874 unique QA pairs and each video clip contains 11.3 events in average. There are 78,297, 64,619 and 44,841 examples in NT, ET and HT, respectively. Note that there are 3.5K examples that can be answered using a single frame of video; the portion of such examples is only less than 2%. The other examples are event-centric; 98K examples require to focus on a single event out of multiple ones while 86K need to recognize multiple events for counting (55K) or identifying their temporal relationships (44K). Note that there are instances that belong to both cases.

Some types of events are more frequently observed than others due to the characteristics of the game, which is also common in real datasets. To make our dataset more balanced, we have a limit for the maximum number of same QA pairs. The QA pair distribution of each subset is depicted in Figure 2. The innermost circles show the distributions of the three question types. The portion of event-centric questions is much larger than those of the other types in all three subsets as we focus on the event-centric questions. The middle circles present instance distributions in each question type, where we observe a large portion of kill event since kill events occur with more diverse arguments such as multiple kinds of enemies and weapons. The outermost circles show the answer distributions related to individual events or states.

The characteristics of VideoQA datasets are presented in Table 1. The number of examples in is larger than the other datasets but fragmented into many subsets, which are hard to be used as a whole. MovieQA dataset has extra information in other modalities. Both datasets have limitations in evaluating model capacity for video understanding as in the examples in Table 2. The questions in are mainly about the salient contents throughout the videos. These questions can be answered by understanding a single frame rather than multiple ones. On the other hand, the questions in MovieQA are often too difficult to answer by watching videos as they require very high-level abstraction about movie story. In contrast, MarioQA contains videos with multiple events and their temporal relationships. The event-centric questions with temporal dependency allow us to evaluate whether the model can reason temporal relationships between multiple events.

Neural Models for MarioQA

We describe our neural baseline models for MarioQA. All networks comprise of three components: question embedding, video embedding and classification networks depicted in Figure 3. We explore each component in detail below.

2 Video Embedding Network

where $T=K/4$ and $V$ denotes an input video. Once the video features are extracted from the 3DFCN, we embed these features volume $\bm{f}^{v}$ onto a low-dimensional spaces using one of the following ways.

2.2 Embedding with Spatio-Temporal Attention

We can attend to a single feature in a spatio-temporal feature volume. The attention score $s_{t,i,j}$ for each feature $f^{v}_{t,i,j}$ and the attention probability $\alpha_{t,i,j}$ is given respectively by

2.3 Global Context Embedding

3 Classification Network

where $W_{q}$ is $512\times 2400$ weight matrix and $\sigma$ is a nonlinear function such as ReLU. Then, we fuse the two embedded vectors and generate the final classification score by

Experiments

We have three subsets in MarioQA dataset as presented in Table 4. We aim to analyze the impact of questions with temporal relationships in training, so we train models on the following three combinations of the subsets: NT (case 1), NT+ET (case 2) and NT+ET+HT (case 3). Then, these models are evaluated to verify temporal reasoning capability on the test split of each subset. We implement two versions of the temporal attention models with one and two attention steps (1-T and 2-T) following . The spatio-temporal attention model (ST) and the global context embedding (GC) are also implemented. In addition to these models, we build three simple baselines:

Video Only (V) Given a video, the model predicts an answer without knowing the question. We perform video embedding by Eq. (12) and predict answers using a multi-layer perceptron with one hidden layer.

Question Only (Q) This model predicts an answer by observing questions but without seeing videos. The same question embedding network is used with the classification.

Average Pooling (AP) This model embeds the video feature volume $\bm{f}^{v}$ by average pooling throughout the spatio-temporal space and use it for final classification. This model is for comparisons with the attention models as the average pooling is equivalent to assigning the uniform attention to every spatio-temporal location.

All the models are trained end-to-end by the standard backpropagation from scratch while the question embedding network is initialized with a pretrained model . The vocabulary sizes of questions are 136, 168 and 168 for NT, ET and HT, respectively, and the number of answer classes is 57.

In our scenario, the initial model of each algorithm is trained with NT only and we evaluate algorithms in all three subsets by simply computing the ratio of correct answers to the total number of questions. Then, we add ET and HT to training data one by one, and perform the same evaluation and observe tendency of performance change.

2 Results

Table 3 presents the overall results of our experiments. Obviously, two simple baselines (V and Q) show significantly lower performance than the othersTo demonstrate the strength of trained models, we evaluate random guess accuracies, which are 16.96, 12.66 and 12.66 (%) for NT, ET and HT, respectively. Also, The accuracies obtained by selecting the most frequent answer are 37.85, 32.29 and 39.00 (%) for NT, ET and HT, respectively.. Although three attention-based models and AP outperform GC in case 1, GC becomes very competitive in case 2 and case 3. It is probably because network architectures with general capabilities such as fully-connected layers are more powerful than the linear combinations of attentive features; if the dataset is properly constructed involving examples with temporal relationships, GC is likely to achieve high performance. However, attention models may be able to gain more benefit from pretrained models, and GC is a more preferable model for our environment with short video clips.

Our results strongly suggest that proper training data construction would help to learn a better model. Figure 4 presents qualitative results for an HT question in all three cases. It shows that the models tend to predict the correct answer better as ET and HT are added to training dataset. The quantitative impact of adding ET and HT to training data is illustrated in Figure 5. By adding ET to training dataset, we observe improvement of all algorithms with attention models (and AP) in all three subsets, where performance gains in ET are most significant consistently in all algorithms. The similar observation is found when HT is additionally included in the training dataset although the magnitudes of improvement are relatively small. This makes sense because the accuracies are getting more saturated as more data are used for training.

It is noticeable that training with the subsets that require more difficult temporal reasoning also improves performance of the subsets with easier temporal reasoning; training with ET or HT improves performance not only on ET and HT but also on NT. It is also interesting that training with ET still improves the accuracy on HT. Since ET does not contain any distracting events, questions in ET can be answered conceptually regardless of temporal relationships. However, the improvement on HT in case 2 intimates that the networks still learn a way of temporally relating events using ET.

One may argue that the improvement mainly comes from the increased number of training examples in case 2 and case 3. To clarify this issue, we train GC models for case 2 and case 3 using roughly the same number of training examples with case 1 (Table 5). Due to smaller training datasets, the overall accuracies are not as good as our previous experiment but the performance improvement tendency is almost same. It is interesting that three cases on NT testing set achieve almost same accuracy in this experiment even with less training examples in NT for case 2 and case 3. This fact shows that ET and HT are helpful to solve questions in NT.

Conclusion

We propose a new analysis framework for VideoQA and construct a customizable synthetic dataset, MarioQA. Unlike existing datasets, MarioQA focuses on event-centric questions with temporal relationships to evaluate temporal reasoning capability of algorithms. The questions and answers in MarioQA are automatically generated based on manually constructed question templates. We use our dataset for the analysis on the impact of questions with temporal relationships in training and show that properly collected dataset is critical to improve quality of VideoQA models. However, we believe that MarioQA can be used for further analyses on other perspectives of VideoQA by customizing the dataset with preferred characteristics in its generation.

This work was partly supported by the ICT R&D program of MSIP/IITP [2014-0-00059 and 2016-0-00563] and the NRF grant [NRF-2011-0031648] in Korea. Samsung funded our preliminary work in part.