Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning

Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, Anton van den Hengel

Introduction

The combined interpretation of vision and language has enabled the development of a range of applications that have made interesting steps towards Artificial Intelligence, including Image Captioning , Visual Question Answering (VQA) , and Referring Expressions . VQA, for example, requires an agent to answer a previously unseen question about a previously unseen image, and is recognised as being an AI-Complete problem . Visual Dialogue represents an extension to the VQA problem whereby an agent is required to engage in a dialogue about an image. This is significant because it demands that the agent is able to answer a series of questions, each of which may be predicated on the previous questions and answers in the dialogue. Visual Dialogue thus reflects one of the key challenges in AI and Robotics, which is to enable an agent capable of acting upon the world, that we might collaborate with through dialogue.

Due to the similarity between the VQA and Visual Dialog tasks, VQA methods have been directly applied to solve the Visual Dialog problem. The fact that the Visual Dialog challenge requires an ongoing conversation, however, demands more than just taking into consideration the state of the conversation thus far. Ideally, the agent should be an engaged participant in the conversation, cooperating towards a larger goal, rather than generating single word answers, even if they are easier to optimise. Figure 1 provides an example of the distinction between the type of responses a VQA agent might generate and the more involved responses that a human is likely to generate if they are engaged in the conversation. These more human-like responses are not only longer, they provide reasoning information that might be of use even though it is not specifically asked for.

Previous Visual Dialog systems follow a neural translation mechanism that is often used in VQA, by predicting the response given the image and the dialog history using the maximum likelihood estimation (MLE) objective function. However, because this over-simplified training objective only focus on measuring the word-level correctness, the produced responses tend to be generic and repetitive. For example, a simple response of ‘yes’,‘no’, or ‘I don’t know’ can safely answer a large number of questions and lead to a high MLE objective value. Generating more comprehensive answers, and a deeper engagement of the agent in the dialogue, requires a more engaged training process.

A good dialogue generation model should generate responses indistinguishable from those a human might produce. In this paper, we introduce an adversarial learning strategy, motivated by the previous success of adversarial learning in many computer vision and sequence generation problems. We particularly frame the task as a reinforcement learning problem that we jointly train two sub-modules: a sequence generative model to produce response sentences on the basis of the image content and the dialog history, and a discriminator that leverages previous generator’s memories to distinguish between the human-generated dialogues and the machine-generated ones. The generator tends to generate responses that can fool the discriminator into believing that they are human generated, while the output of the discriminative model is used as a reward to the generative model, encouraging it to generate more human-like dialogue.

Although our proposed framework is inspired by generative adversarial networks (GANs) , there are several technical contributions that lead to the final success on the visual dialog generation task. First, we propose a sequential co-attention generative model that aims to ensure that attention can be passed effectively across the image, question and dialog history. The co-attended multi-modal features are combined together to generate a response. Secondly, and significantly, within the structure we propose the discriminator has access to the attention weights the generator used in generating its response. Note that the attention weights can be seen as a form of ‘reason’ for the generated response. For example, it indicates which region should be focused on and what dialog pairs are informative when generating the response. This structure is important as it allows the discriminator to assess the quality of the response, given the reason. It also allows the discriminator to assess the response in the context of the dialogue thus far. Finally, as with most sequence generation problems, the quality of the response can only be assessed over the whole sequence. We follow to apply Monte Carlo (MC) search to calculate the intermediate rewards.

We evaluate our method on the VisDial dataset and show that it outperforms the baseline methods by a large margin. We also outperform several state-of-the-art methods. Specifically, our adversarial learned generative model outperforms our strong baseline MLE model by 1.87% on recall@5, improving over previous best reported results by 2.14% on recall@5, and 2.50% recall@10. Qualitative evaluation shows that our generative model generates more informative responses and a human study shows that 49% of our responses pass the Turing Test. We additionally implement a model under the discriminative setting (a candidate response list is given) and achieve the state-of-the-art performance.

Related work

is the latest in a succession of vision-and-language problems that began with image captioning , and includes visual question answering . However, in contrast to these classical vision-and-language tasks that only involve at most a single natural language interaction, visual dialog requires the machine to hold a meaningful dialogue in natural language about visual content. Mostafazadeh et al. propose an Image Grounded Conversation (IGC) dataset and task that requires a model to generate natural-sounding conversations (both questions and responses) about a shared image. De Vries et al. propose a ‘GuessWhat’ game style dataset, where one person asks questions about an image to guess which object has been ‘selected’, and the second person answers questions in ‘yes’/‘no’/NA. Das et al. propose the largest visual dialog dataset, VisDial, by pairing two subjects on Amazon Mechanical Turk to chat about an image. They further formulate the task as a ‘multi-round’ VQA task and evaluate individual responses at each round in a retrieval or multiple-choice setup. Recently, Das et al. propose to use RL to learn the policies of a ‘Questioner-Bot’ and an ‘Answerer-Bot’, based on the goal of selecting the right images that the two agents are talking, from the VisDial dataset.

Concurrent with our work, Lu et al. propose a similar generative-discriminative model for Visual Dialog. However, there are two differences. First, their discriminative model requires to receive a list of candidate responses and learns to sort this list from the training dataset, which means the model only can be trained when such information is available. Second, their discriminator only considers the generated response and the provided list of candidate responses. Instead, we measure whether the generated response is valid given the attention weights which reflect both the reasoning of the model, and the history of the dialogue thus far. As we show in our experiments in Sec. 4, this procedure results in our generator producing more suitable responses.

Dialog generation in NLP

Text-only dialog generation has been studied for many years in the Natural Language Processing (NLP) literature, and has leaded to many applications. Recently, the popular ‘Xiaoice’ produced by Microsoft and the ‘Its Alive’ chatbot created by Facebook have attracted significant public attention. In NLP, dialog generation is typically viewed as a sequence-to-sequence (Seq2Seq) problem, or formulated as a statistical machine translation problem . Inspired by the success of the Seq2Seq model in the machine translation, build end-to-end dialog generation models using an encoder-decoder model. Reinforcement learning (RL) has also been applied to train a dialog system. Li et al. simulate two virtual agents and hand-craft three rewards (informativity, coherence and ease of answering) to train the response generation model. Recently, some works make an effort to integrate the Seq2Seq model and RL. For example, introduce real users by combining RL with neural generation.

Li et al. in were the first to introduce GANs for dialogue generation as an alternative to human evaluation. They jointly train a generative (Seq2Seq) model to produce response sequences and a discriminator to distinguish between human, and machine-generated responses. Although we also introduce an adversarial learning framework to the visual dialog generation in this work, one of the significant differences is that we need to consider the visual content in both generative and discriminative components of the system, where the previous work only requires textual information. We thus designed a sequential co-attention mechanism for the generator and an attention memory access mechanism for the discriminator so that we can jointly reason over the visual and textual information. Critically, the GAN we proposed here is tightly integrated into the attention mechanism that generates human-interpretable reasons for each answer. It means that the discriminative model of the GAN has the task of assessing whether a candidate answer is generated by a human or not, given the provided reason. This is significant because it drives the generative model to produce high quality answers that are well supported by the associated reasoning. More details about our generator and discriminator can be found in Sections 3.1 and 3.2 respectively.

Adversarial learning

Generative adversarial networks have enjoyed great successes in a wide range of applications in Computer Vision, , especially in image generation tasks . The learning process is formulated as an adversarial game in which the generative model is trained to generate outputs to fool the discriminator, while the discriminator is trained not to be fooled. These two models can be jointly trained end-to-end. Some recent works have applied the adversarial learning to sequence generation, for example, Yu et al. backpropagate the error from the discriminator to the sequence generator by using policy gradient reinforcement learning. This model shows outstanding performance on several sequence generation problems, such as speech generation and poem generation. The work is further extended to more tasks such as image captioning and dialog generation . Our work is also inspired by the success of adversarial learning, but we carefully extend it according to our application, i.e. the Visual Dialog. Specifically, we redesign the generator and discriminator in order to accept multi-modal information (visual content and dialog history). We also apply an intermediate reward for each generation step in the generator, more details can be found in Sec. 3.3.

Adversarial Learning for Visual Dialog Generation

In this section, we describe our adversarial learning approach to generating natural dialog responses based on an image. There are several ways of defining the visual based dialog generation task . We follow the one in , in which an image $I$ , a ‘ground truth’ dialog history (including an image description $C$ ) $H=(C,(Q_{1},A_{1}),...,(Q_{t-1},A_{t-1}))$ (we define each Question-Answer (QA) pair as an utterance $U_{t}$ , and $U_{0}=C$ ), and the question $Q$ are given. The visual dialog generation model is required to return a response sentence $\hat{A}=[a_{1},a_{2},...,a_{K}]$ to the question, where $K$ is the length (number of words) of the response answer. As in VQA, two types of models may be used to produce the response — generative and discriminative. In a generative decoder, a word sequence generator (for example, an RNN) is trained to fit the ground truth answer word sequences. For a discriminative decoder, an additional candidate response vocabulary is provided and the problem is re-formulated as a multi-class classification problem. The biggest limitation of the discriminative style decoder is that it only can produce a response if and only if it exists in the fixed vocabulary. Our approach is based on a generative model because a fixed vocabulary undermines the general applicability of the model, but also because it offers a better prospect of being extensible to the problem of generating more meaningful dialogue in future.

In terms of reinforcement learning, our response sentence generation process can be viewed as a sequence of prediction actions that are taken according to a policy defined by a sequential co-attention generative model. This model is critical as it allows attention (and thus reasoning) to pass across image, question, and dialogue history equally. A discriminator is trained to label whether a response is human generated or machine generated, conditioned on the image, question and dialog attention memories. Considering here that as we take the dialog and the image as a whole into account, we are actually measuring whether the generated response can be fitted into the visual dialog. The output from this discriminative model is used as a reward to the previous generator, pushing it to generate responses that are more fitting with the dialog history. In order to consider the reward at the local (i.e. word and phase) level, we use a Monte Carlo (MC) search strategy and the REINFORCE algorithm is used to update the policy gradient. An overview of our model can be found in the Fig. 2. In the following sections, we will introduce each component of our model separately.

We employ the encoder-decoder style generative model which has been widely used in the sequence generation problems. In contrast to text-only dialog generation problem that only needs to consider the dialog history, however, visual dialog generation additionally requires the model to understand visual information. And distinct from VQA that only has one round of questioning, visual dialog has multiple rounds of dialog history that need to be accessed and understood. It suggests that an encoder that can combine multiple information sources is required. A naive way of doing this is to represent the inputs - image, history and question separately and then concatenate them to learn a joint representation. We contend, however, that it is more powerful to let the model selectively focus on regions of the image and segments of the dialog history according to the question.

Based on this, we propose a sequential co-attention mechanism . Specifically, we first use a pre-trained CNN to extract the spatial image features $V=[v_{1},\dots,v_{N}]$ from the convolutional layer, where $N$ is the number of image regions. The question features is $Q=[q_{1},\dots,q_{L}]$ , where $q_{l}=LSTM(w_{l},q_{l-1})$ , which is the hidden state of an LSTM at step $l$ given the input word $w_{l}$ of the question. $L$ is the length of the question. Because the history $H$ is composed by a sequence of utterance, we extract each utterance feature separately to make up the dialog history features, i.e., $U=[u_{0},\dots,u_{T}]$ , where $T$ is the number of rounds of the utterance (QA-pairs). And each $u$ is the last hidden state of an LSTM, which accepts the utterance words sequences as the input.

where $[;]$ is a concatenation operator. Finally, this vector representation is fed to an LSTM to compute the probability of generating each token in the target using a softmax function, which forms the response $\hat{A}$ . The whole generation process is denoted as $\pi(\hat{A}|V,U,Q)$ .

2 A discriminative model with attention memories

3 Adversarial REINFORCE with an intermediate reward

where $p$ is the probability of the generated responses words, $a_{k}$ is the $k$ -th word in the response. $b$ denotes the baseline value. Following , we train a critic neural network to estimate the baseline value $b$ by given the current state under the current generation policy $\pi$ . The critic network takes the visual content, dialog history and question as input, encodes them to a vector representation with our co-attention model and maps the representation to a scalar. The critic neural network is optimised based on the mean squared loss between the estimated reward and the real reward obtained from the discriminator. The entire model can be trained end-to-end, with the discriminator updating synchronously. We use the human generated dialog history and answers as the positive examples and the machine generated responses as negative examples.

An issue in the above vanilla REINFORCE is it only considers a reward value for a finished sequence, and the reward associated with this sequence is used for all actions, i.e., the generation of each token. However, as a sequence generation problem, rewards for intermediate steps are necessary. For example, given a question ‘Are they adults or babies?’, the human-generated answer is ‘I would say they are adults’, while the machine-generated answer is ‘I can’t tell’. The above REINFORCE model will give the same low reward to all the tokens for the machine-generated answer, but a proper reward assignment way is to give the reward separately, i.e., a high reward to the token ‘I’ and low rewards for the token ‘can’t’ and ‘tell’.

Considering that the discriminator is only trained to assign rewards to fully generated sentences, but not intermediate ones, we propose to use the Monte Carlo (MC) search with a roll-out (generator) policy $\pi$ to sample tokens. An N-time MC search can be represented as:

where $\hat{A}_{1:k}^{n}=(a_{1},\dots,a_{k})$ and $\hat{A}_{k+1:K}^{n}$ are sampled based on the roll-out policy $\pi$ and the current state. We run the roll-out policy starting from the current state till the end of the sequence for $N$ times and the $N$ generated answers share a common prefix $\hat{A}_{1:k}$ . These $N$ sequences are fed to the discriminator, the average score

of which is used as a reward for the action of generating the token $a_{k}$ . With this intermediate reward, our gradient is computed as:

where we can see the intermediate rewards for each generation action are considered.

Teacher forcing

Although the reward returned from the discriminator has been used to adjust the generation process, we find it is still important to feed human generated responses to the generator for the model updating. Hence, we apply a teacher forcing strategy to update the parameters in the generator. Specifically, at each training iteration, we first update the generator using the reward obtained from the sampled data with the generator policy. Then we sample some data from the real dialog history and use them to update the generator, with a standard maximum likelihood estimation (MLE) objective. The whole training process is reviewed in the Alg. 1.

Experiments

We evaluate our model on a recently published visual dialog generation dataset, VisDial . Images in Visdial are all from the MS COCO , which contain multiple objects in everyday scenes. The dialogs in Visdial are collected by pairing 2 AMT works (a ‘questioner’ and an ‘answerer’) to chat with each other about an image. To make the dialog measurable, the image remains hidden to the questioner and the task of the questioner is to ask questions about this hidden image to ‘imagine the scene better’. The answerer sees the image and his task is to answer questions asked by the questioner. Hence, the conversation is more like multi-rounds of visual based question answering and it only can be ended after 10 rounds. There are 83k dialogs in the COCO training split and 40k in the validation split, for totally 1,232,870 QA pairs, in the Visdial v0.9, which is the latest available version thus far. Following , we use 80k dialogs for train, 3k for val and 40k as the test.

Different from the previous language generation tasks that normally use BLEU, MENTOR or ROUGE score for evaluation, we follow to use a retrieval setting to evaluate the individual responses at each round of a dialog. Specifically, at test time, besides the image, ground truth dialog history and the question, a list of 100 candidates answers are also given. The model is evaluated on retrieval metrics: (1) rank of human response, (2) existence of the human response in top- $k$ ranked responses, i.e., recall@ $k$ and (3) mean reciprocal rank (MRR) of the human response. Since we focus on evaluating the generalization ability of our generator, we simply rank the candidates by the generative model’s log-likelihood scores.

2 Implementation Details

To pre-process the data, we first lowercase all the texts, convert digits to words, and remove contractions, before tokenizing. The captions, questions and answers are further truncated to ensure that they are no longer than 40, 20 and 20, respectively. We then construct the vocabulary of words that appear at least 5 times in the training split, giving us a vocabulary of 8845 words. The words are represented as one-hot vector and 512-d embeddings for the words are learned. These word embeddings are shared across question, history, decoder LSTMs. All the LSTMs in our model are 1-layered with 512 hidden states. The Adam optimizer is used with the base learning rate of $10^{-3}$ , further decreasing to $10^{-5}$ . We use 5-time Monte Carlo (MC) search for each token. The co-attention generative model is pre-trained using the ground-truth dialog history for 30 epochs. We also pre-train our discriminator (for 30 epochs), where the positive examples are sampled from the ground-truth dialog, the negative examples are sampled from the dialog generated by our generator. The discriminator is updated after every 20 generator-updating steps.

3 Experiment results

We compare our model with a number of baselines and state-of-the-art models. Answer Prior is a naive baseline that encodes answer options with an LSTM and scored by a linear classifier, which captures ranking by frequency of answers in the training set. NN finds the nearest neighbor images and questions for a test question and its related image. The options are then ranked by their mean-similarity to answers to these questions. Late Fusion (LF) encodes the image, dialog history and question separately and later concatenated together and linearly transformed to a joint representation. HRE applies a hierarchical recurrent encoder to encode the dialog history and the HREA additionally adds an attention mechanism on the dialogs. Memory Network (MN) maintains each previous question and answer as a ‘fact’ in its memory bank and learns to refer to the stored facts and image to answer the question. A concurrent work proposes a HCIAE (History-Conditioned Image Attentive Encoder) to attend on image and dialog features.

From Table 1, we can see our final generative model CoAtt-GAN-w/ $\mathbf{R}_{inte}$ -TF performs the best on all the evaluation metrics. Comparing to the previous state-of-the-art model MN , our model outperforms it by 3.81% on R@1. We also produce better results than the HCIAE model, which is the previous best results that without using any discriminative knowledges. Figure 4 shows some qualitative results of our model. More results can be found in the supplementary material.

Ablation study

Our model contains several components. In order to verify the contribution of each component, we evaluate several variants of our model.

CoAtt-G-MLE is the generative model that uses our co-attention mechanism shown in Sec. 3.1. This model is trained only with the MLE objective, without any adversarial learning strategies. Hence, it can be used as a baseline model for other variants.

CoAtt-GAN-w/o $\mathbf{R}_{inte}$ is the extension of above CoAtt-G model, with an adversarial learning strategy. The reward from the discriminator is used to guide the generator training, but we only use the global reward to calculate the gradient, as shown in Equ. 8.

CoAtt-GAN-w/ $\mathbf{R}_{inte}$ uses the intermediate reward as shown in the Equ. 10 and 11.

CoAtt-GAN-w/ $\mathbf{R}_{inte}$ -TF is our final model which adds a ‘teacher forcing’ after the adversarial learning.

Our baseline CoAtt-G-MLE model outperforms the previous attention based models (HREA, MN, HCIAE) shows that our co-attention mechanism can effectively encode the complex multi-source information. CoAtt-GAN-w/o $\mathbf{R}_{inte}$ produces slightly better results than our baseline model by using the adversarial learning network, but the improvement is limited. The intermediate reward mechanism contributes the most to the improvement, i.e., our proposed CoAtt-GAN-w/ $\mathbf{R}_{inte}$ model improves over our baseline by average 1%. The additional Teacher-Forcing model (our final model) brings the further improvement, by average 0.5%, achieving the best results.

Discriminative setting

We additionally implement a model for the discriminative task on the Visdial dataset . In this discriminative setting, there is no need to generate a string, instead, a pre-defined answer set is given and the problem is formulated as a classification problem. We modify our model by replacing the response generation LSTM (can be treated as a multi-step classification process) as a single-step classifier. HCIAE-NP-ATT is the original HCIAE model with a n-pair discriminative loss and a self-attention mechanism. AMEM applies a more advanced memory network to model the dependency of current question on previous attention. Additional two VQA models are used for comparison. Table 2 shows that our model outperforms the previous baseline and state-of-the-art models on all the evaluation metrics.

4 Human study

Above experiments verify the effectiveness of our proposed model on the Visdial task. In this section, to check whether our model can generate more human-like dialogs, we conduct a human study.

We randomly sample 1000 results from the test dataset in different length, generated by our final model, our baseline model CoAtt-G-MLE, and the Memory Network (MN)we use the author provided code and pre-trained model provided on https://github.com/batra-mlp-lab/visdial model. We then ask 3 human subjects to guess whether the last response in the dialog is human-generated or machine-generated and if at least 2 of them agree it is generated by a human, we say it passed the Truing Test. Table 3 summarizes the percentage of responses in the dialog that passes the Turing Test (M1), we can see our model outperforms both the baseline model and the MN model. We also apply our discriminator model in Sec. 3.2 on these 1000 samples and it recognizes that nearly 70% percent of them as human-generated responses (random guess is 50%), which suggests that our final generator successfully fool the discriminator in this adversarial learning. We additionally record the percentage of responses that are evaluated as better than or equal to human responses (M2), according to the human subjects’ manual evaluation. As shown in Table 3, 45% of the responses fall into this case.

Conclusion

Visual Dialog generation is an interesting topic that requires machine to understand visual content, natural language dialog and have the ability of multi-modal reasoning. More importantly, as a human-computer interaction interface for the further robotics and AI, apart from the correctness, the human-like level of the generated response is a significant index. In this paper, we have proposed an adversarial learning based approach to encourage the generator to generate more human-like dialogs. Technically, by combining a sequential co-attention generative model that can jointly reason the image, dialog history and question, and a discriminator that can dynamically access to the attention memories, with an intermediate reward, our final proposed model achieves the state-of-art on VisDial dataset. A Turing Test fashion study also shows that our model can produce more human-like visual dialog responses.