Visual Dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra

Introduction

We are witnessing unprecedented advances in computer vision (CV) and artificial intelligence (AI) – from ‘low-level’ AI tasks such as image classification , scene recognition , object detection – to ‘high-level’ AI tasks such as learning to play Atari video games and Go , answering reading comprehension questions by understanding short stories , and even answering questions about images and videos !

What lies next for AI? We believe that the next generation of visual intelligence systems will need to posses the ability to hold a meaningful dialog with humans in natural language about visual content. Applications include:

Aiding visually impaired users in understanding their surroundings or social media content (AI: ‘John just uploaded a picture from his vacation in Hawaii’, Human: ‘Great, is he at the beach?’, AI: ‘No, on a mountain’).

Aiding analysts in making decisions based on large quantities of surveillance data (Human: ‘Did anyone enter this room last week?’, AI: ‘Yes, 27 instances logged on camera’, Human: ‘Were any of them carrying a black bag?’),

Interacting with an AI assistant (Human: ‘Alexa – can you see the baby in the baby monitor?’, AI: ‘Yes, I can’, Human: ‘Is he sleeping or playing?’).

Robotics applications (e.g. search and rescue missions) where the operator may be ‘situationally blind’ and operating via language (Human: ‘Is there smoke in any room around you?’, AI: ‘Yes, in one room’, Human: ‘Go there and look for people’).

Despite rapid progress at the intersection of vision and language – in particular, in image captioning and visual question answering (VQA) – it is clear that we are far from this grand goal of an AI agent that can ‘see’ and ‘communicate’. In captioning, the human-machine interaction consists of the machine simply talking at the human (‘Two people are in a wheelchair and one is holding a racket’), with no dialog or input from the human. While VQA takes a significant step towards human-machine interaction, it still represents only a single round of a dialog – unlike in human conversations, there is no scope for follow-up questions, no memory in the system of previous questions asked by the user nor consistency with respect to previous answers provided by the system (Q: ‘How many people on wheelchairs?’, A: ‘Two’; Q: ‘How many wheelchairs?’, A: ‘One’).

As a step towards conversational visual AI, we introduce a novel task – Visual Dialog – along with a large-scale dataset, an evaluation protocol, and novel deep models.

Task Definition. The concrete task in Visual Dialog is the following – given an image $I$ , a history of a dialog consisting of a sequence of question-answer pairs (Q1: ‘How many people are in wheelchairs?’, A1: ‘Two’, Q2: ‘What are their genders?’, A2: ‘One male and one female’), and a natural language follow-up question (Q3: ‘Which one is holding a racket?’), the task for the machine is to answer the question in free-form natural language (A3: ‘The woman’). This task is the visual analogue of the Turing Test.

Consider the Visual Dialog examples in Fig. 2. The question ‘What is the gender of the one in the white shirt?’ requires the machine to selectively focus and direct attention to a relevant region. ‘What is she doing?’ requires co-reference resolution (whom does the pronoun ‘she’ refer to?), ‘Is that a man to her right?’ further requires the machine to have visual memory (which object in the image were we talking about?). Such systems also need to be consistent with their outputs – ‘How many people are in wheelchairs?’, ‘Two’, ‘What are their genders?’, ‘One male and one female’ – note that the number of genders being specified should add up to two. Such difficulties make the problem a highly interesting and challenging one.

Why do we talk to machines? Prior work in language-only (non-visual) dialog can be arranged on a spectrum with the following two end-points: goal-driven dialog (e.g. booking a flight for a user) $\longleftrightarrow$ goal-free dialog (or casual ‘chit-chat’ with chatbots). The two ends have vastly differing purposes and conflicting evaluation criteria. Goal-driven dialog is typically evaluated on task-completion rate (how frequently was the user able to book their flight) or time to task completion – clearly, the shorter the dialog the better. In contrast, for chit-chat, the longer the user engagement and interaction, the better. For instance, the goal of the 2017 $2.5 Million Amazon Alexa Prize is to “create a socialbot that converses coherently and engagingly with humans on popular topics for 20 minutes.”

We believe our instantiation of Visual Dialog hits a sweet spot on this spectrum. It is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded enough in vision to allow objective evaluation of individual responses and benchmark progress. The former discourages task-engineered bots for ‘slot filling’ and the latter discourages bots that put on a personality to avoid answering questions while keeping the user engaged .

Contributions. We make the following contributions:

We propose a new AI task: Visual Dialog, where a machine must hold dialog with a human about visual content.

We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). Upon completionVisDial data on COCO-train ( $\sim$ 83k images) and COCO-val ( $\sim$ 40k images) is already available for download at https://visualdialog.org. Since dialog history contains the ground-truth caption, we will not be collecting dialog data on COCO-test. Instead, we will collect dialog data on 20k extra images from COCO distribution (which will be provided to us by the COCO team) for our test set., VisDial will contain 1 dialog each (with 10 question-answer pairs) on $\sim$ 140k images from the COCO dataset , for a total of $\sim$ 1.4M dialog question-answer pairs. When compared to VQA , VisDial studies a significantly richer task (dialog), overcomes a ‘visual priming bias’ in VQA (in VisDial, the questioner does not see the image), contains free-form longer answers, and is an order of magnitude larger.

We introduce a family of neural encoder-decoder models for Visual Dialog with 3 novel encoders

Late Fusion: that embeds the image, history, and question into vector spaces separately and performs a ‘late fusion’ of these into a joint embedding.

Hierarchical Recurrent Encoder: that contains a dialog-level Recurrent Neural Network (RNN) sitting on top of a question-answer ( $QA$ )-level recurrent block. In each $QA$ -level recurrent block, we also include an attention-over-history mechanism to choose and attend to the round of the history relevant to the current question.

Memory Network: that treats each previous $QA$ pair as a ‘fact’ in its memory bank and learns to ‘poll’ the stored facts and the image to develop a context vector.

We train all these encoders with 2 decoders (generative and discriminative) – all settings outperform a number of sophisticated baselines, including our adaption of state-of-the-art VQA models to VisDial.

We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a list of candidate answers and evaluated on metrics such as mean-reciprocal-rank of the human response.

We conduct studies to quantify human performance.

Putting it all together, on the project page we demonstrate the first visual chatbot!

Related Work

Vision and Language. A number of problems at the intersection of vision and language have recently gained prominence – image captioning , video/movie description , text-to-image coreference/grounding , visual storytelling , and of course, visual question answering (VQA) . However, all of these involve (at most) a single-shot natural language interaction – there is no dialog. Concurrent with our work, two recent works have also begun studying visually-grounded dialog.

Visual Turing Test. Closely related to our work is that of Geman et al. , who proposed a fairly restrictive ‘Visual Turing Test’ – a system that asks templated, binary questions. In comparison, 1) our dataset has free-form, open-ended natural language questions collected via two subjects chatting on Amazon Mechanical Turk (AMT), resulting in a more realistic and diverse dataset (see Fig. 5). 2) The dataset in only contains street scenes, while our dataset has considerably more variety since it uses images from COCO . Moreover, our dataset is two orders of magnitude larger – 2,591 images in vs $\sim$ 140k images, 10 question-answer pairs per image, total of $\sim$ 1.4M QA pairs.

Text-based Question Answering. Our work is related to text-based question answering or ‘reading comprehension’ tasks studied in the NLP community. Some recent large-scale datasets in this domain include the 30M Factoid Question-Answer corpus , 100K SimpleQuestions dataset , DeepMind Q&A dataset , the 20 artificial tasks in the bAbI dataset , and the SQuAD dataset for reading comprehension . VisDial can be viewed as a fusion of reading comprehension and VQA. In VisDial, the machine must comprehend the history of the past dialog and then understand the image to answer the question. By design, the answer to any question in VisDial is not present in the past dialog – if it were, the question would not be asked. The history of the dialog contextualizes the question – the question ‘what else is she holding?’ requires a machine to comprehend the history to realize who the question is talking about and what has been excluded, and then understand the image to answer the question.

Conversational Modeling and Chatbots. Visual Dialog is the visual analogue of text-based dialog and conversation modeling. While some of the earliest developed chatbots were rule-based , end-to-end learning based approaches are now being actively explored . A recent large-scale conversation dataset is the Ubuntu Dialogue Corpus , which contains about 500K dialogs extracted from the Ubuntu channel on Internet Relay Chat (IRC). Liu et al. perform a study of problems in existing evaluation protocols for free-form dialog. One important difference between free-form textual dialog and VisDial is that in VisDial, the two participants are not symmetric – one person (the ‘questioner’) asks questions about an image that they do not see; the other person (the ‘answerer’) sees the image and only answers the questions (in otherwise unconstrained text, but no counter-questions allowed). This role assignment gives a sense of purpose to the interaction (why are we talking? To help the questioner build a mental model of the image), and allows objective evaluation of individual responses.

The Visual Dialog Dataset (VisDial)

We now describe our VisDial dataset. We begin by describing the chat interface and data-collection process on AMT, analyze the dataset, then discuss the evaluation protocol.

Consistent with previous data collection efforts, we collect visual dialog data on images from the Common Objects in Context (COCO) dataset, which contains multiple objects in everyday scenes. The visual complexity of these images allows for engaging and diverse conversations.

Live Chat Interface. Good data for this task should include dialogs that have (1) temporal continuity, (2) grounding in the image, and (3) mimic natural ‘conversational’ exchanges. To elicit such responses, we paired 2 workers on AMT to chat with each other in real-time (Fig. 3). Each worker was assigned a specific role. One worker (the ‘questioner’) sees only a single line of text describing an image (caption from COCO); the image remains hidden to the questioner. Their task is to ask questions about this hidden image to ‘imagine the scene better’. The second worker (the ‘answerer’) sees the image and caption. Their task is to answer questions asked by their chat partner. Unlike VQA , answers are not restricted to be short or concise, instead workers are encouraged to reply as naturally and ‘conversationally’ as possible. Fig. 3(c) shows an example dialog.

This process is an unconstrained ‘live’ chat, with the only exception that the questioner must wait to receive an answer before posting the next question. The workers are allowed to end the conversation after 20 messages are exchanged (10 pairs of questions and answers). Further details about our final interface can be found in the supplement.

We also piloted a different setup where the questioner saw a highly blurred version of the image, instead of the caption. The conversations seeded with blurred images resulted in questions that were essentially ‘blob recognition’ – ‘What is the pink patch at the bottom right?’. For our full-scale data-collection, we decided to seed with just the captions since it resulted in more ‘natural’ questions and more closely modeled the real-world applications discussed in Section 1 where no visual signal is available to the human.

Building a 2-person chat on AMT. Despite the popularity of AMT as a data collection platform in computer vision, our setup had to design for and overcome some unique challenges – the key issue being that AMT is simply not designed for multi-user Human Intelligence Tasks (HITs). Hosting a live two-person chat on AMT meant that none of the Amazon tools could be used and we developed our own backend messaging and data-storage infrastructure based on Redis messaging queues and Node.js. To support data quality, we ensured that a worker could not chat with themselves (using say, two different browser tabs) by maintaining a pool of worker IDs paired. To minimize wait time for one worker while the second was being searched for, we ensured that there was always a significant pool of available HITs. If one of the workers abandoned a HIT (or was disconnected) midway, automatic conditions in the code kicked in asking the remaining worker to either continue asking questions or providing facts (captions) about the image (depending on their role) till 10 messages were sent by them. Workers who completed the task in this way were fully compensated, but our backend discarded this data and automatically launched a new HIT on this image so a real two-person conversation could be recorded. Our entire data-collection infrastructure (front-end UI, chat interface, backend storage and messaging system, error handling protocols) is publicly availablehttps://github.com/batra-mlp-lab/visdial-amt-chat.

VisDial Dataset Analysis

We now analyze the v0.9 subset of our VisDial dataset – it contains 1 dialog (10 QA pairs) on $\sim$ 123k images from COCO-train/val, a total of 1,232,870 QA pairs.

Visual Priming Bias. One key difference between VisDial and previous image question-answering datasets (VQA , Visual 7W , Baidu mQA ) is the lack of a ‘visual priming bias’ in VisDial. Specifically, in all previous datasets, subjects saw an image while asking questions about it. As analyzed in , this leads to a particular bias in the questions – people only ask ‘Is there a clocktower in the picture?’ on pictures actually containing clock towers. This allows language-only models to perform remarkably well on VQA and results in an inflated sense of progress . As one particularly perverse example – for questions in the VQA dataset starting with ‘Do you see a …’, blindly answering ‘yes’ without reading the rest of the question or looking at the associated image results in an average VQA accuracy of $87\%$ ! In VisDial, questioners do not see the image. As a result, this bias is reduced.

Distributions. Fig. 4(a) shows the distribution of question lengths in VisDial – we see that most questions range from four to ten words. Fig. 5 shows ‘sunbursts’ visualizing the distribution of questions (based on the first four words) in VisDial vs. VQA. While there are a lot of similarities, some differences immediately jump out. There are more binary questions Questions starting in ‘Do’, ‘Did’, ‘Have’, ‘Has’, ‘Is’, ‘Are’, ‘Was’, ‘Were’, ‘Can’, ‘Could’. in VisDial as compared to VQA – the most frequent first question-word in VisDial is ‘is’ vs. ‘what’ in VQA. A detailed comparison of the statistics of VisDial vs. other datasets is available in Table 1 in the supplement.

Finally, there is a stylistic difference in the questions that is difficult to capture with the simple statistics above. In VQA, subjects saw the image and were asked to stump a smart robot. Thus, most queries involve specific details, often about the background (‘What program is being utilized in the background on the computer?’). In VisDial, questioners did not see the original image and were asking questions to build a mental model of the scene. Thus, the questions tend to be open-ended, and often follow a pattern:

Generally starting with the entities in the caption:

‘An elephant walking away from a pool in an exhibit’,

digging deeper into their parts or attributes:

‘Is it full grown?’, ‘Is it facing the camera?’,

asking about the scene category or the picture setting:

‘Is this indoors or outdoors?’, ‘Is this a zoo?’,

‘Are there people?’, ‘Is there shelter for elephant?’,

and asking follow-up questions about the new visual entities discovered from these explorations:

‘There’s a blue fence in background, like an enclosure’,

2 Analyzing VisDial Answers

Answer Lengths. Fig. 4(a) shows the distribution of answer lengths. Unlike previous datasets, answers in VisDial are longer and more descriptive – mean-length 2.9 words (VisDial) vs 1.1 (VQA), 2.0 (Visual 7W), 2.8 (Visual Madlibs).

Fig. 4(b) shows the cumulative coverage of all answers (y-axis) by the most frequent answers (x-axis). The difference between VisDial and VQA is stark – the top-1000 answers in VQA cover $\sim$ 83% of all answers, while in VisDial that figure is only $\sim$ 63%. There is a significant heavy tail in VisDial – most long strings are unique, and thus the coverage curve in Fig. 4(b) becomes a straight line with slope 1. In total, there are 337,527 unique answers in VisDial v0.9.

Answer Types. Since the answers in VisDial are longer strings, we can visualize their distribution based on the starting few words (Fig. 5(c)). An interesting category of answers emerges – ‘I think so’, ‘I can’t tell’, or ‘I can’t see’ – expressing doubt, uncertainty, or lack of information. This is a consequence of the questioner not being able to see the image – they are asking contextually relevant questions, but not all questions may be answerable with certainty from that image. We believe this is rich data for building more human-like AI that refuses to answer questions it doesn’t have enough information to answer. See for a related, but complementary effort on question relevance in VQA.

Binary Questions vs Binary Answers. In VQA, binary questions are simply those with ‘yes’, ‘no’, ‘maybe’ as answers . In VisDial, we must distinguish between binary questions and binary answers. Binary questions are those starting in ‘Do’, ‘Did’, ‘Have’, ‘Has’, ‘Is’, ‘Are’, ‘Was’, ‘Were’, ‘Can’, ‘Could’. Answers to such questions can (1) contain only ‘yes’ or ‘no’, (2) begin with ‘yes’, ‘no’, and contain additional information or clarification, (3) involve ambiguity (‘It’s hard to see’, ‘Maybe’), or (4) answer the question without explicitly saying ‘yes’ or ‘no’ (Q: ‘Is there any type of design or pattern on the cloth?’, A: ‘There are circles and lines on the cloth’). We call answers that contain ‘yes’ or ‘no’ as binary answers – 149,367 and 76,346 answers in subsets (1) and (2) from above respectively. Binary answers in VQA are biased towards ‘yes’ – 61.40% of yes/no answers are ‘yes’. In VisDial, the trend is reversed. Only 46.96% are ‘yes’ for all yes/no responses. This is understandable since workers did not see the image, and were more likely to end up with negative responses.

3 Analyzing VisDial Dialog

In Section 4.1, we discussed a typical flow of dialog in VisDial. We analyze two quantitative statistics here.

Coreference in dialog. Since language in VisDial is the result of a sequential conversation, it naturally contains pronouns – ‘he’, ‘she’, ‘his’, ‘her’, ‘it’, ‘their’, ‘they’, ‘this’, ‘that’, ‘those’, etc. In total, 38% of questions, 19% of answers, and nearly all (98%) dialogs contain at least one pronoun, thus confirming that a machine will need to overcome coreference ambiguities to be successful on this task. We find that pronoun usage is low in the first round (as expected) and then picks up in frequency. A fine-grained per-round analysis is available in the supplement.

Temporal Continuity in Dialog Topics. It is natural for conversational dialog data to have continuity in the ‘topics’ being discussed. We have already discussed qualitative differences in VisDial questions vs. VQA. In order to quantify the differences, we performed a human study where we manually annotated question ‘topics’ for $40$ images (a total of $400$ questions), chosen randomly from the val set. The topic annotations were based on human judgement with a consensus of 4 annotators, with topics such as: asking about a particular object (‘What is the man doing?’) , scene (‘Is it outdoors or indoors?’), weather (“Is the weather sunny?’), the image (‘Is it a color image?’), and exploration (‘Is there anything else?”). We performed similar topic annotation for questions from VQA for the same set of $40$ images, and compared topic continuity in questions. Across $10$ rounds, VisDial question have $4.55\pm 0.17$ topics on average, confirming that these are not independent questions. Recall that VisDial has $10$ questions per image as opposed to $3$ for VQA. Therefore, for a fair comparison, we compute average number of topics in VisDial over all subsets of $3$ successive questions. For $500$ bootstrap samples of batch size $40$ , VisDial has $2.14\pm 0.05$ topics while VQA has $2.53\pm 0.09$ . Lower mean suggests there is more continuity in VisDial because questions do not change topics as often.

4 VisDial Evaluation Protocol

One fundamental challenge in dialog systems is evaluation. Similar to the state of affairs in captioning and machine translation, it is an open problem to automatically evaluate the quality of free-form answers. Existing metrics such as BLEU, METEOR, ROUGE are known to correlate poorly with human judgement in evaluating dialog responses .

Instead of evaluating on a downstream task or holistically evaluating the entire conversation (as in goal-free chit-chat ), we evaluate individual responses at each round ( $t=1,2,\ldots,10$ ) in a retrieval or multiple-choice setup.

Specifically, at test time, a VisDial system is given an image $I$ , the ‘ground-truth’ dialog history (including the image caption) $C,(Q_{1},A_{1}),\ldots,(Q_{t-1},A_{t-1})$ , the question $Q_{t}$ , and a list of $N=100$ candidate answers, and asked to return a sorting of the candidate answers. The model is evaluated on retrieval metrics – (1) rank of human response (lower is better), (2) recall@ $k$ , i.e. existence of the human response in top- $k$ ranked responses, and (3) mean reciprocal rank (MRR) of the human response (higher is better).

The evaluation protocol is compatible with both discriminative models (that simply score the input candidates, e.g. via a softmax over the options, and cannot generate new answers), and generative models (that generate an answer string, e.g. via Recurrent Neural Networks) by ranking the candidates by the model’s log-likelihood scores.

Candidate Answers. We generate a candidate set of correct and incorrect answers from four sets: Correct: The ground-truth human response to the question. Plausible: Answers to 50 most similar questions. Similar questions are those that start with similar tri-grams and mention similar semantic concepts in the rest of the question. To capture this, all questions are embedded into a vector space by concatenating the GloVe embeddings of the first three words with the averaged GloVe embeddings of the remaining words in the questions. Euclidean distances are used to compute neighbors. Since these neighboring questions were asked on different images, their answers serve as ‘hard negatives’. Popular: The 30 most popular answers from the dataset – e.g. ‘yes’, ‘no’, ‘2’, ‘1’, ‘white’, ‘3’, ‘grey’, ‘gray’, ‘4’, ‘yes it is’. The inclusion of popular answers forces the machine to pick between likely a priori responses and plausible responses for the question, thus increasing the task difficulty. Random: The remaining are answers to random questions in the dataset. To generate 100 candidates, we first find the union of the correct, plausible, and popular answers, and include random answers until a unique set of 100 is found.

Neural Visual Dialog Models

In this section, we develop a number of neural Visual Dialog answerer models. Recall that the model is given as input – an image $I$ , the ‘ground-truth’ dialog history (including the image caption) $H=(\underbrace{\vphantom{(Q_{1},A_{1})}C}_{H_{0}},\underbrace{(Q_{1},A_{1})}_{H_{1}},\ldots,\underbrace{(Q_{t-1},A_{t-1})}_{H_{t-1}})$ , the question $Q_{t}$ , and a list of 100 candidate answers $\mathcal{A}_{t}=\{A^{(1)}_{t},\ldots,A^{(100)}_{t}\}$ – and asked to return a sorting of $\mathcal{A}_{t}$ .

At a high level, all our models follow the encoder-decoder framework, i.e. factorize into two parts – (1) an encoder that converts the input $(I,H,Q_{t})$ into a vector space, and (2) a decoder that converts the embedded vector into an output. We describe choices for each component next and present experiments with all encoder-decoder combinations.

Generative (LSTM) decoder: where the encoded vector is set as the initial state of the Long Short-Term Memory (LSTM) RNN language model. During training, we maximize the log-likelihood of the ground truth answer sequence given its corresponding encoded representation (trained end-to-end). To evaluate, we use the model’s log-likelihood scores and rank candidate answers.

Note that this decoder does not need to score options during training. As a result, such models do not exploit the biases in option creation and typically underperform models that do , but it is debatable whether exploiting such biases is really indicative of progress. Moreover, generative decoders are more practical in that they can actually be deployed in realistic applications.

Discriminative (softmax) decoder: computes dot product similarity between input encoding and an LSTM encoding of each of the answer options. These dot products are fed into a softmax to compute the posterior probability over options. During training, we maximize the log-likelihood of the correct option. During evaluation, options are simply ranked based on their posterior probabilities.

Late Fusion (LF) Encoder: In this encoder, we treat $H$ as a long string with the entire history $(H_{0},\ldots,H_{t-1})$ concatenated. $Q_{t}$ and $H$ are separately encoded with 2 different LSTMs, and individual representations of participating inputs $(I,H,Q_{t})$ are concatenated and linearly transformed to a desired size of joint representation.

Hierarchical Recurrent Encoder (HRE): In this encoder, we capture the intuition that there is a hierarchical nature to our problem – each question $Q_{t}$ is a sequence of words that need to be embedded, and the dialog as a whole is a sequence of question-answer pairs $(Q_{t},A_{t})$ . Thus, similar to , as shown in Fig. 6, we propose an HRE model that contains a dialog-RNN sitting on top of a recurrent block ( $R_{t}$ ). The recurrent block $R_{t}$ embeds the question and image jointly via an LSTM (early fusion), embeds each round of the history $H_{t}$ , and passes a concatenation of these to the dialog-RNN above it. The dialog-RNN produces both an encoding for this round ( $E_{t}$ in Fig. 6) and a dialog context to pass onto the next round. We also add an attention-over-history (‘Attention’ in Fig. 6) mechanism allowing the recurrent block $R_{t}$ to choose and attend to the round of the history relevant to the current question. This attention mechanism consists of a softmax over previous rounds ( $0,1,\ldots,t-1$ ) computed from the history and question+image encoding.

Memory Network (MN) Encoder: We develop a MN encoder that maintains each previous question and answer as a ‘fact’ in its memory bank and learns to refer to the stored facts and image to answer the question. Specifically, we encode $Q_{t}$ with an LSTM to get a $512$ -d vector, encode each previous round of history $(H_{0},\ldots,H_{t-1})$ with another LSTM to get a $t\times 512$ matrix. We compute inner product of question vector with each history vector to get scores over previous rounds, which are fed to a softmax to get attention-over-history probabilities. Convex combination of history vectors using these attention probabilities gives us the ‘context vector’, which is passed through an fc-layer and added to the question vectorto construct the MN encoding. In the language of Memory Network , this is a ‘1-hop’ encoding.

We use a ‘[encoder]-[input]-[decoder]’ convention to refer to model-input combinations. For example, ‘LF-QI-D’ has a Late Fusion encoder with question+image inputs (no history), and a discriminative decoder. Implementation details about the models can be found in the supplement.

Experiments

Splits. VisDial v0.9 contains 83k dialogs on COCO-train and 40k on COCO-val images. We split the 83k into 80k for training, 3k for validation, and use the 40k as test.

Data preprocessing, hyperparameters and training details are included in the supplement.

Baselines We compare to a number of baselines: Answer Prior: Answer options to a test question are encoded with an LSTM and scored by a linear classifier. This captures ranking by frequency of answers in our training set without resolving to exact string matching. NN-Q: Given a test question, we find $k$ nearest neighbor questions (in GloVe space) from train, and score answer options by their mean-similarity with these $k$ answers. NN-QI: First, we find $K$ nearest neighbor questions for a test question. Then, we find a subset of size $k$ based on image feature similarity. Finally, we rank options by their mean-similarity to answers to these $k$ questions. We use $k=20,K=100$ .

Finally, we adapt several (near) state-of-art VQA models (SAN , HieCoAtt ) to Visual Dialog. Since VQA is posed as classification, we ‘chop’ the final VQA-answer softmax from these models, feed these activations to our discriminative decoder (Section 5), and train end-to-end on VisDial. Note that our LF-QI-D model is similar to that in . Altogether, these form fairly sophisticated baselines.

Results. Tab. 5 shows results for our models and baselines on VisDial v0.9 (evaluated on 40k from COCO-val).

A few key takeaways – 1) As expected, all learning based models significantly outperform non-learning baselines. 2) All discriminative models significantly outperform generative models, which as we discussed is expected since discriminative models can tune to the biases in the answer options. 3) Our best generative and discriminative models are MN-QIH-G with 0.526 MRR, and MN-QIH-D with 0.597 MRR. 4) We observe that naively incorporating history doesn’t help much (LF-Q vs. LF-QH and LF-QI vs. LF-QIH) or can even hurt a little (LF-QI-G vs. LF-QIH-G). However, models that better encode history (MN/HRE) perform better than corresponding LF models with/without history (e.g. LF-Q-D vs. MN-QH-D). 5) Models looking at $I$ ({LF,MN,HRE }-QIH) outperform corresponding blind models (without $I$ ).

Human Studies. We conduct studies on AMT to quantitatively evaluate human performance on this task for all combinations of {with image, without image} $\times$ {with history, without history}. We find that without image, humans perform better when they have access to dialog history. As expected, this gap narrows down when they have access to the image. Complete details can be found in supplement.

Conclusions

To summarize, we introduce a new AI task – Visual Dialog, where an AI agent must hold a dialog with a human about visual content. We develop a novel two-person chat data-collection protocol to curate a large-scale dataset (VisDial), propose retrieval-based evaluation protocol, and develop a family of encoder-decoder models for Visual Dialog. We quantify human performance on this task via human studies. Our results indicate that there is significant scope for improvement, and we believe this task can serve as a testbed for measuring progress towards visual intelligence.

Acknowledgements

We thank Harsh Agrawal, Jiasen Lu for help with AMT data collection; Xiao Lin, Latha Pemula for model discussions; Marco Baroni, Antoine Bordes, Mike Lewis, Marc’Aurelio Ranzato for helpful discussions. We are grateful to the developers of Torch for building an excellent framework. This work was funded in part by NSF CAREER awards to DB and DP, ONR YIP awards to DP and DB, ONR Grant N00014-14-1-0679 to DB, a Sloan Fellowship to DP, ARO YIP awards to DB and DP, an Allen Distinguished Investigator award to DP from the Paul G. Allen Family Foundation, ICTAS Junior Faculty awards to DB and DP, Google Faculty Research Awards to DP and DB, Amazon Academic Research Awards to DP and DB, AWS in Education Research grant to DB, and NVIDIA GPU donations to DB. SK was supported by ONR Grant N00014-12-1-0903. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

Appendix Overview

This supplementary document is organized as follows:

Sec. A studies how and why VisDial is more than just a collection of independent Q&As.

Sec. B shows qualitative examples from our dataset.

Sec. C presents detailed human studies along with comparisons to machine accuracy. The interface for human studies is demonstrated in a videohttps://goo.gl/yjlHxY.

Sec. D shows snapshots of our two-person chat data-collection interface on Amazon Mechanical Turk. The interface is also demonstrated in the video3.

Sec. E presents further analysis of VisDial, such as question types, question and answer lengths per question type. A video with an interactive sunburst visualization of the dataset is included3.

Sec. F presents performance of our models on VisDial v0.5 test.

Sec. G presents implementation-level training details including data preprocessing, and model architectures.

Putting it all together, we compile a video demonstrating our visual chatbot3 that answers a sequence of questions from a user about an image. This demo uses one of our best generative models from the main paper, MN-QIH-G, and uses sampling (without any beam-search) for inference in the LSTM decoder. Note that these videos demonstrate an ‘unscripted’ dialog – in the sense that the particular QA sequence is not present in VisDial and the model is not provided with any list of answer options.

Appendix A In what ways are dialogs in VisDial more than just 10 visual Q&As?

In this section, we lay out an exhaustive list of differences between VisDial and image question-answering datasets, with the VQA dataset serving as the representative.

In essence, we characterize what makes an instance in VisDial more than a collection of 10 independent question-answer pairs about an image – what makes it a dialog.

In order to be self-contained and an exhaustive list, some parts of this section repeat content from the main document.

Fig. 7(a) shows the distribution of answer lengths in VisDial. and Tab. 2 compares statistics of VisDial with existing image question answering datasets. Unlike previous datasets, answers in VisDial are longer, conversational, and more descriptive – mean-length 2.9 words (VisDial) vs 1.1 (VQA), 2.0 (Visual 7W), 2.8 (Visual Madlibs). Moreover, $37.1\%$ of answers in VisDial are longer than 2 words while the VQA dataset has only $3.8\%$ answers longer than 2 words.

Fig. 7(b) shows the cumulative coverage of all answers (y-axis) by the most frequent answers (x-axis). The difference between VisDial and VQA is stark – the top-1000 answers in VQA cover $\sim$ 83% of all answers, while in VisDial that figure is only $\sim$ 63%. There is a significant heavy tail of answers in VisDial – most long strings are unique, and thus the coverage curve in Fig. 7(b) becomes a straight line with slope 1. In total, there are 337,527 unique answers in VisDial (out of the 1,232,870 answers currently in the dataset).

A.2 VisDial has co-references in dialogs

People conversing with each other tend to use pronouns to refer to already mentioned entities. Since language in VisDial is the result of a sequential conversation, it naturally contains pronouns – ‘he’, ‘she’, ‘his’, ‘her’, ‘it’, ‘their’, ‘they’, ‘this’, ‘that’, ‘those’, etc. In total, $38\%$ of questions, $19\%$ of answers, and nearly all ( $98\%$ ) dialogs contain at least one pronoun, thus confirming that a machine will need to overcome coreference ambiguities to be successful on this task. As a comparison, only $9\%$ of questions and $0.25\%$ of answers in VQA contain at least one pronoun.

In Fig. 8, we see that pronoun usage is lower in the first round compared to other rounds, which is expected since there are fewer entities to refer to in the earlier rounds. The pronoun usage is also generally lower in answers than questions, which is also understandable since the answers are generally shorter than questions and thus less likely to contain pronouns. In general, the pronoun usage is fairly consistent across rounds (starting from round 2) for both questions and answers.

A.3 VisDial has smoothness/continuity in ‘topics’

There is a stylistic difference in the questions asked in VisDial (compared to the questions in VQA) due to the nature of the task assigned to the subjects asking the questions. In VQA, subjects saw the image and were asked to “stump a smart robot”. Thus, most queries involve specific details, often about the background (Q: ‘What program is being utilized in the background on the computer?’). In VisDial, questioners did not see the original image and were asking questions to build a mental model of the scene. Thus, the questions tend to be open-ended, and often follow a pattern:

Generally starting with the entities in the caption:

‘An elephant walking away from a pool in an exhibit’,

digging deeper into their parts, attributes, or properties:

‘Is it full grown?’, ‘Is it facing the camera?’,

asking about the scene category or the picture setting:

‘Is this indoors or outdoors?’, ‘Is this a zoo?’,

‘Are there people?’, ‘Is there shelter for elephant?’,

and asking follow-up questions about the new visual entities discovered from these explorations:

‘There’s a blue fence in background, like an enclosure’,

Such a line of questioning does not exist in the VQA dataset, where the subjects were shown the questions already asked about an image, and explicitly instructed to ask about different entities .

Counting the Number of Topics.

In order to quantify these qualitative differences, we performed a human study where we manually annotated question ‘topics’ for $40$ images (a total of $400$ questions), chosen randomly from the val set. The topic annotations were based on human judgement with a consensus of 4 annotators, with topics such as: asking about a particular object (‘What is the man doing?’), the scene (‘Is it outdoors or indoors?’), the weather (“Is the weather sunny?’), the image (‘Is it a color image?’), and exploration (‘Is there anything else?”). We performed similar topic annotation for questions from VQA for the same set of $40$ images, and compared topic continuity in questions.

Across $10$ rounds, VisDial questions have $4.55\pm 0.17$ topics on average, confirming that these are not $10$ independent questions. Recall that VisDial has $10$ questions per image as opposed to $3$ for VQA. Therefore, for a fair comparison, we compute average number of topics in VisDial over all ‘sliding windows’ of $3$ successive questions. For $500$ bootstrap samples of batch size $40$ , VisDial has $2.14\pm 0.05$ topics while VQA has $2.53\pm 0.09$ . Lower mean number of topics suggests there is more continuity in VisDial because questions do not change topics as often.

Transition Probabilities over Topics.

We can take this analysis a step further by computing topic transition probabilities over topics as follows. For a given sequential dialog exchange, we now count the number of topic transitions between consecutive QA pairs, normalized by the total number of possible transitions between rounds (9 for VisDial and 2 for VQA). We compute this ‘topic transition probability’ (how likely are two successive QA pairs to be about two different topics) for VisDial and VQA in two different settings – (1) in-order and (2) with a permuted sequence of QAs. Note that if VisDial were simply a collection of 10 independent QAs as opposed to a dialog, we would expect the topic transition probabilities to be similar for in-order and permuted variants. However, we find that for 1000 permutations of 40 topic-annotated image-dialogs, in-order-VisDial has an average topic transition probability of $0.61$ , while permuted-VisDial has $0.76\pm 0.02$ . In contrast, VQA has a topic transition probability of $0.80$ for in-order vs. $0.83\pm 0.02$ for permuted QAs.

There are two key observations: (1) In-order transition probability is lower for VisDial than VQA (i.e. topic transition is less likely in VisDial), and (2) Permuting the order of questions results in a larger increase for VisDial, around $0.15$ , compared to a mere $0.03$ in case of VQA (i.e. in-order-VQA and permuted-VQA behave significantly more similarly than in-order-VisDial and permuted-VisDial).

Both these observations establish that there is smoothness in the temporal order of topics in VisDial, which is indicative of the narrative structure of a dialog, rather than independent question-answers.

A.4 VisDial has the statistics of an NLP dialog dataset

In this analysis, our goal is to measure whether VisDial behaves like a dialog dataset.

In particular, we compare VisDial, VQA, and Cornell Movie-Dialogs Corpus . The Cornell Movie-Dialogs corpus is a text-only dataset extracted from pairwise interactions between characters from approximately $617$ movies, and is widely used as a standard dialog corpus in the natural language processing (NLP) and dialog communities.

One popular evaluation criteria used in the dialog-systems research community is the perplexity of language models trained on dialog datasets – the lower the perplexity of a model, the better it has learned the structure in the dialog dataset.

For the purpose of our analysis, we pick the popular sequence-to-sequence (Seq2Seq) language model and use the perplexity of this model trained on different datasets as a measure of temporal structure in a dataset.

As is standard in the dialog literature, we train the Seq2Seq model to predict the probability of utterance $U_{t}$ given the previous utterance $U_{t-1}$ , i.e. $\textbf{P}(U_{t}\mid U_{t-1})$ on the Cornell corpus. For VisDial and VQA, we train the Seq2Seq model to predict the probability of a question $Q_{t}$ given the previous question-answer pair, i.e. $\textbf{P}(Q_{t}\mid(Q_{t-1},A_{t-1}))$ .

For each dataset, we used its train and val splits for training and hyperparameter tuning respectively, and report results on test. At test time, we only use conversations of length $10$ from Cornell corpus for a fair comparison to VisDial (which has 10 rounds of QA).

For all three datasets, we created $100$ permuted versions of test, where either QA pairs or utterances are randomly shuffled to disturb their natural order. This allows us to compare datasets in their natural ordering w.r.t. permuted orderings. Our hypothesis is that since dialog datasets have linguistic structure in the sequence of QAs or utterances they contain, this structure will be significantly affected by permuting the sequence. In contrast, a collection of independent question-answers (as in VQA) will not be significantly affected by a permutation.

Tab. 3 compares the original, unshuffled test with the shuffled testsets on two metrics:

We compute the standard metric of perplexity per token, i.e. exponent of the normalized negative-log-probability of a sequence (where normalized is by the length of the sequence). Tab. 3 shows these perplexities for the original unshuffled test and permuted test sequences.

First, we note that the absolute perplexity values are higher for the Cornell corpus than QA datasets. We hypothesize that this is due to the broad, unrestrictive dialog generation task in Cornell corpus, which is a more difficult task than question prediction about images, which is in comparison a more restricted task.

Second, in all three datasets, the shuffled test has statistically significant higher perplexity than the original test, which indicates that shuffling does indeed break the linguistic structure in the sequences.

Classification:

As our second metric to compare datasets in their natural vs. permuted order, we test whether we can reliably classify a given sequence as natural or permuted.

Our classifier is a simple threshold on perplexity of a sequence. Specifically, given a pair of sequences, we compute the perplexity of both from our Seq2Seq model, and predict that the one with higher perplexity is the sequence in permuted ordering, and the sequence with lower perplexity is the one in natural ordering. The accuracy of this simple classifier indicates how easy or difficult it is to tell the difference between natural and permuted sequences. A higher classification rate indicates existence of temporal continuity in the conversation, thus making the ordering important.

Tab. 3 shows the classification accuracies achieved on all datasets. We can see that the classifier on VisDial achieves the highest accuracy ( $73.3\%$ ), followed by Cornell ( $61.0\%$ ). Note that this is a binary classification task with the prior probability of each class by design being equal, thus chance performance is $50\%$ . The classifiers on VisDial and Cornell both significantly outperforming chance. On the other hand, the classifier on VQA is near chance ( $52.8\%$ ), indicating a lack of general temporal continuity.

To summarize this analysis, our experiments show that VisDial is significantly more dialog-like than VQA, and behaves more like a standard dialog dataset, the Cornell Movie-Dialogs corpus.

A.5 VisDial eliminates visual priming bias in VQA

One key difference between VisDial and previous image question answering datasets (VQA , Visual 7W , Baidu mQA ) is the lack of a ‘visual priming bias’ in VisDial. Specifically, in all previous datasets, subjects saw an image while asking questions about it. As described in , this leads to a particular bias in the questions – people only ask ‘Is there a clocktower in the picture?’ on pictures actually containing clock towers. This allows language-only models to perform remarkably well on VQA and results in an inflated sense of progress . As one particularly perverse example – for questions in the VQA dataset starting with ‘Do you see a …’, blindly answering ‘yes’ without reading the rest of the question or looking at the associated image results in an average VQA accuracy of $87\%$ ! In VisDial, questioners do not see the image. As a result, this bias is reduced.

This lack of visual priming bias (i.e. not being able to see the image while asking questions) and holding a dialog with another person while asking questions results in the following two unique features in VisDial.

Since the answers in VisDial are longer strings, we can visualize their distribution based on the starting few words (Fig. 9). An interesting category of answers emerges – ‘I think so’, ‘I can’t tell’, or ‘I can’t see’ – expressing doubt, uncertainty, or lack of information. This is a consequence of the questioner not being able to see the image – they are asking contextually relevant questions, but not all questions may be answerable with certainty from that image. We believe this is rich data for building more human-like AI that refuses to answer questions it doesn’t have enough information to answer. See for a related, but complementary effort on question relevance in VQA.

Binary Questions ≠\neq Binary Answers in VisDial.

In VQA, binary questions are simply those with ‘yes’, ‘no’, ‘maybe’ as answers . In VisDial, we must distinguish between binary questions and binary answers. Binary questions are those starting in ‘Do’, ‘Did’, ‘Have’, ‘Has’, ‘Is’, ‘Are’, ‘Was’, ‘Were’, ‘Can’, ‘Could’. Answers to such questions can (1) contain only ‘yes’ or ‘no’, (2) begin with ‘yes’, ‘no’, and contain additional information or clarification (Q: ‘Are there any animals in the image?’, A: ‘yes, 2 cats and a dog’), (3) involve ambiguity (‘It’s hard to see’, ‘Maybe’), or (4) answer the question without explicitly saying ‘yes’ or ‘no’ (Q: ‘Is there any type of design or pattern on the cloth?’, A: ‘There are circles and lines on the cloth’). We call answers that contain ‘yes’ or ‘no’ as binary answers – 149,367 and 76,346 answers in subsets (1) and (2) from above respectively. Binary answers in VQA are biased towards ‘yes’ – 61.40% of yes/no answers are ‘yes’. In VisDial, the trend is reversed. Only 46.96% are ‘yes’ for all yes/no responses. This is understandable since workers did not see the image, and were more likely to end up with negative responses.

Appendix B Qualitative Examples from VisDial

Fig. 10 shows random samples of dialogs from the VisDial dataset.

Appendix C Human-Machine Comparison

We conducted studies on AMT to quantitatively evaluate human performance on this task for all combinations of {with image, without image} $\times$ {with history, without history} on 100 random images at each of the 10 rounds. Specifically, in each setting, we show human subjects a jumbled list of 10 candidate answers for a question – top-9 predicted responses from our ‘LF-QIH-D’ model and the 1 ground truth answer – and ask them to rank the responses. Each task was done by 3 human subjects.

Results of this study are shown in the top-half of Tab. 4. We find that without access to the image, humans perform better when they have access to dialog history – compare the Human-QH row to Human-Q (R@1 of 30.31 vs. 25.10). As perhaps expected, this gap narrows down when humans have access to the image – compare Human-QIH to Human-QI (R@1 of 48.03 vs. 46.12).

Note that these numbers are not directly comparable to machine performance reported in the main paper because models are tasked with ranking 100 responses, while humans are asked to rank 10 candidates. This is because the task of ranking 100 candidate responses would be too cumbersome for humans.

To compute comparable human and machine performance, we evaluate our best discriminative (MN-QIH-D) and generative (HREA-QIH-G, MN-QIH-G) We use both HREA-QIH-G, MN-QIH-G since they have similar accuracies. models on the same 10 options that were presented to humans. Note that in this setting, both humans and machines have R $@10$ = $1.0$ , since there are only 10 options.

Tab. 4 bottom-half shows the results of this comparison. We can see that, as expected, humans with full information (i.e. Human-QIH) perform the best with a large gap in human and machine performance (compare R@5: Human-QIH $83.76\%$ vs. MN-QIH-D $69.39\%$ ). This gap is even larger when compared to generative models, which unlike the discriminative models are not actively trying to exploit the biases in the answer candidates (compare R@5: Human-QIH $83.76\%$ vs. HREA-QIH-G $61.61\%$ ).

Furthermore, we see that humans outperform the best machine even when not looking at the image, simply on the basis of the context provided by the history (compare R@5: Human-QH $70.53\%$ vs. MN-QIH-D $69.39\%$ ).

Perhaps as expected, with access to the image but not the history, humans are significantly better than the best machines (R@5: Human-QI $82.54\%$ vs. MN-QIH-D $69.39\%$ ). With access to history humans perform even better.

From in-house human studies and worker feedback on AMT, we find that dialog history plays the following roles for humans: (1) provides a context for the question and paints a picture of the scene, which helps eliminate certain answer choices (especially when the image is not available), (2) gives cues about the answerer’s response style, which helps identify the right answer among similar answer choices, and (3) disambiguates amongst likely interpretations of the image (i.e., when objects are small or occluded), again, helping identify the right answer among multiple plausible options.

Appendix D Interface

In this section, we show our interface to connect two Amazon Mechanical Turk workers live, which we used to collect our data.

Instructions. To ensure quality of data, we provide detailed instructions on our interface as shown in Fig. 11(a). Since the workers do not know their roles before starting the study, we provide instructions for both questioner and answerer roles.

After pairing: Immediately after pairing two workers, we assign them roles of a questioner and a answerer and display role-specific instructions as shown in Fig. 11(b). Observe that the questioner does not see the image while the answerer does have access to it. Both questioner and answerer see the caption for the image.

Appendix E Additional Analysis of VisDial

In this section, we present additional analyses characterizing our VisDial dataset.

Fig. 12 shows question lengths by type and round. Average length of question by type is consistent across rounds. Questions starting with ‘any’ (‘any people?’, ‘any other fruits?’, etc.) tend to be the shortest. Fig. 13 shows answer lengths by type of question they were said in response to and round. In contrast to questions, there is significant variance in answer lengths. Answers to binary questions (‘Any people?’, ‘Can you see the dog?’, etc.) tend to be short while answers to ‘how’ and ‘what’ questions tend to be more explanatory and long. Across question types, answers tend to be the longest in the middle of conversations.

E.2 Question Types

Fig. 14 shows round-wise coverage by question type. We see that as conversations progress, ‘is’, ‘what’ and ‘how’ questions reduce while ‘can’, ‘do’, ‘does’, ‘any’ questions occur more often. Questions starting with ‘Is’ are the most popular in the dataset.

Appendix F Performance on VisDial v0.5

Tab. 5 shows the results for our proposed models and baselines on VisDial v0.5. A few key takeaways – First, as expected, all learning based models significantly outperform non-learning baselines. Second, all discriminative models significantly outperform generative models, which as we discussed is expected since discriminative models can tune to the biases in the answer options. This improvement comes with the significant limitation of not being able to actually generate responses, and we recommend the two decoders be viewed as separate use cases. Third, our best generative and discriminative models are MN-QIH-G with 0.44 MRR, and MN-QIH-D with 0.53 MRR that outperform a suite of models and sophisticated baselines. Fourth, we observe that models with $H$ perform better than $Q$ -only models, highlighting the importance of history in VisDial. Fifth, models looking at $I$ outperform both the blind models ( $Q$ , $QH$ ) by at least $2\%$ on recall@ $1$ in both decoders. Finally, models that use both $H$ and $I$ have best performance.

Dialog-level evaluation. Using R $@5$ to define round-level ‘success’, our best discriminative model MN-QIH-D gets $7.01$ rounds out of 10 correct, while generative MN-QIH-G gets $5.37$ . Further, the mean first-failure-round (under $R@5$ ) for MN-QIH-D is $3.23$ , and $2.39$ for MN-QIH-G. Fig. 16(a) and Fig. 16(b) show plots for all values of $k$ in $R@k$ .

Appendix G Experimental Details

In this section, we describe details about our models, data preprocessing, training procedure and hyperparameter selection.

We encode the image with a VGG-16 CNN, question and concatenated history with separate LSTMs and concatenate the three representations. This is followed by a fully-connected layer and tanh non-linearity to a $512$ -d vector, which is used to decode the response. Fig. 17(a) shows the model architecture for our LF encoder.

Hierarchical Recurrent Encoder (HRE).

In this encoder, the image representation from VGG-16 CNN is early fused with the question. Specifically, the image representation is concatenated with every question word as it is fed to an LSTM. Each QA-pair in dialog history is independently encoded by another LSTM with shared weights. The image-question representation, computed for every round from $1$ through $t$ , is concatenated with history representation from the previous round and constitutes a sequence of question-history vectors. These vectors are fed as input to a dialog-level LSTM, whose output state at $t$ is used to decode the response to $Q_{t}$ . Fig. 17(b) shows the model architecture for our HRE.

Memory Network.

The image is encoded with a VGG-16 CNN and question with an LSTM. We concatenate the representations and follow it by a fully-connected layer and tanh non-linearity to get a ‘query vector’. Each caption/QA-pair (or ‘fact’) in dialog history is encoded independently by an LSTM with shared weights. The query vector is then used to compute attention over the $t$ facts by inner product. Convex combination of attended history vectors is passed through a fully-connected layer and tanh non-linearity, and added back to the query vector. This combined representation is then passed through another fully-connected layer and tanh non-linearity and then used to decode the response. The model architecture is shown in Fig. 17(c). Fig. 18 shows some examples of attention over history facts from our MN encoder. We see that the model learns to attend to facts relevant to the question being asked. For example, when asked ‘What color are kites?’, the model attends to ‘A lot of people stand around flying kites in a park.’ For ‘Is anyone on bus?’, it attends to ‘A large yellow bus parked in some grass.’ Note that these are selected examples, and not always are these attention weights interpretable.

G.2 Training

Recall that VisDial v0.9 contained 83k dialogs on COCO-train and 40k on COCO-val images. We split the 83k into 80k for training, 3k for validation, and use the 40k as test.

Preprocessing.

We spell-correct VisDial data using the Bing API . Following VQA, we lowercase all questions and answers, convert digits to words, and remove contractions, before tokenizing using the Python NLTK . We then construct a dictionary of words that appear at least five times in the train set, giving us a vocabulary of around $7.5$ k.

Hyperparameters.

All our models are implemented in Torch . Model hyperparameters are chosen by early stopping on val based on the Mean Reciprocal Rank (MRR) metric. All LSTMs are 2-layered with $512$ -dim hidden states. We learn $300$ -dim embeddings for words and images. These word embeddings are shared across question, history, and decoder LSTMs. We use Adam with a learning rate of $10^{-3}$ for all models. Gradients at each iterations are clamped to $$ to avoid explosion. Our code, architectures, and trained models are available at https://visualdialog.org.