End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh

cs.CL cs.CV cs.SD eess.AS

Introduction

Spoken dialog technologies have been applied in real-world human-machine interfaces including smart phone digital assistants, car navigation systems, voice-controlled smart speakers, and human-facing robots . Generally, a dialog system consists of a pipeline of data processing modules, including automatic speech recognition, spoken language understanding, dialog management, sentence generation, and speech synthesis. However, all of these modules require significant hand engineering and domain knowledge for training. Recently, end-to-end dialog systems have been gathering attention, and they obviate this need for expensive hand engineering to some extent. In end-to-end approaches, dialog models are trained using only paired input and output sentences, without relying on pre-designed data processing modules or intermediate internal data representations such as concept tags and slot-value pairs. End-to-end systems can be trained to directly map from a user’s utterance to a system response sentence and/or action. This significantly reduces the data preparation and system development cost. Several types of sequence-to-sequence models have been applied to end-to-end dialog systems, and it has been shown that they can be trained in a completely data-driven manner. End-to-end approaches have also been shown to better handle flexible conversations between the user and the system by training the model on large conversational datasets .

In these applications, however, all conversation is triggered by user speech input, and the contents of system responses are limited by the training data (a set of dialogs). Current dialog systems cannot understand dynamic scenes using multimodal sensor-based input

such as vision and non-speech audio, so machines using such dialog systems cannot have a conversation about what’s going on in their surroundings. To develop machines that can carry on a conversation about objects and events taking place around the machines or the users, dynamic scene-aware dialog technology is essential.

To interact with humans about visual information, systems need to understand both visual scenes and natural language inputs. One naive approach could be a pipeline system in which the output of a visual description system is used as an input to a dialog system. In this cascaded approach, semantic frames such as "who" is doing "what" and "where" must be extracted from the video description results. The prediction of frame type and the value of the frame must be trained using annotated data. In contrast, the recent revolution of neural network models allows us to combine different modules into a single end-to-end differentiable network. We can simultaneously input video features and user utterances into an encoder-decoder-based system whose outputs are natural-language responses.

Using this end-to-end framework, visual question answering (VQA) has been intensively researched in the field of computer vision . The goal of VQA is to generate answers to questions about an imaged scene, using the information present in a single static image. As a further step towards conversational visual AI, the new task of visual dialog was introduced , in which an AI agent holds a meaningful dialog with humans about an image using natural, conversational language . While VQA and visual dialog take significant steps towards human-machine interaction, they only consider a single static image. To capture the semantics of dynamic scenes, recent research has focused on video description (natural-language descriptions of videos). The state of the art in video description uses a multimodal attention mechanism that selectively attends to different input modalities (feature types), such spatiotemporal motion features and audio features, in addition to temporal attention .

Audio Visual Scene-Aware Dialog Dataset

We collected text-based conversations data about short videos for Audio Visual Scene-Aware Dialog (AVSD) as described in using from an existing video description dataset, Charades , for Dialog System Technology Challenge the 7th edition (DSTC7)http://workshop.colips.org/dstc7/call.html. Charades is an untrimmed and multi-action dataset, containing 11,848 videos split into 7985 for training, 1863 for validation, and 2,000 for testing. It has 157 action categories, with several fine-grained actions. Further, this dataset also provides 27,847 textual descriptions for the videos, each video is associated with 1–3 sentences. As these textual descriptions are only available in the training and validation set, we report evaluation results on the validation set.

The data collection paradigm for dialogs was similar to the one described in , in which for each image, two different Mechanical Turk workers interacted via a text interface to yield a dialog. In , each dialog consisted of a sequence of questions and answers about an image. In the video scene-aware dialog case, two Amazon Mechanical Turk (AMT) workers had a discussion about events in a video. One of the workers played the role of an answerer who had already watched the video. The answerer answered questions asked by another AMT worker – the questioner. The questioner was not allowed to watch the whole video but only the first, middle and last frames of the video which were single static images. After having a conversation to to ask about the events that happened between the frames through 10 rounds of QA, the questioner summarized the events in the video as a description.

In total, we collected dialogs for 7043 videos from the Charades training set and all of the validation set (1863 videos). Since we did not have scripts for the test set, we split the validation set into 732 and 733 videos and used them as our validation and test sets respectively. See Table 1 for statistics. The average numbers of words per question and answer are 8 and 10, respectively.

Video Scene-aware Dialog System

We built an end-to-end dialog system that can generate answers in response to user questions about events in a video sequence. Our architecture is similar to the Hierarchical Recurrent Encoder in Das et al. . The question, visual features, and the dialog history are fed into corresponding LSTM-based encoders to build up a context embedding, and then the outputs of the encoders are fed into a LSTM-based decoder to generate an answer. The history consists of encodings of QA pairs. We feed multimodal attention-based video features into the LSTM encoder instead of single static image features. Figure 1 shows the architecture of our video scene-aware dialog system.

This section explains the neural conversation model of , which is designed as a sequence-to-sequence mapping process using recurrent neural networks (RNNs). Let $X$ and $Y$ be input and output sequences, respectively. The model is used to compute posterior probability distribution $P(Y|X)$ . For conversation modeling, $X$ corresponds to the sequence of previous sentences in a conversation, and $Y$ is the system response sentence we want to generate. In our model, both $X$ and $Y$ are sequences of words. $X$ contains all of the previous turns of the conversation, concatenated in sequence, separated by markers that indicate to the model not only that a new turn has started, but which speaker said that sentence. The most likely hypothesis of $Y$ is obtained as

where ${\cal V}^{*}$ denotes a set of sequences of zero or more words in system vocabulary $\cal V$ .

Let $X$ be word sequence $x_{1},\dots,x_{T}$ and $Y$ be word sequence $y_{1},\dots,y_{M}$ . The encoder network is used to obtain hidden states $h_{t}$ for $t=1,\dots,T$ as:

where $h_{0}$ is initialized with a zero vector. $\mbox{LSTM}(\cdot)$ is a LSTM function with parameter set $\theta_{enc}$ .

The decoder network is used to compute probabilities $P(y_{m}|y_{1},\dots,y_{m-1},X)$ for $m=1,\dots,M$ as:

where $y_{0}$ is set to , a special symbol representing the end of sequence. $s_{m}$ is the $m$ -th decoder state. $\theta_{dec}$ is a set of decoder parameters, and $W_{o}$ and $b_{o}$ are a matrix and a vector. In this model, the initial decoder state $s_{0}$ is given by the final encoder state $h_{T}$ as in Eq. (4), and the probability is estimated from each state $s_{m}$ . To efficiently find $\hat{Y}$ in Eq. (1), we use a beam search technique since it is computationally intractable to consider all possible $Y$ .

In the scene-aware-dialog scenario, a scene context vector including audio and visual features is also fed to the decoder. We modify the LSTM in Eqs. (4)–(6) as

where $g_{n}$ is the concatenation of question encoding $g_{n}^{(q)}$ , audio-visual encoding $g_{n}^{(av)}$ and history encoding $g_{n}^{(h)}$ for generating the $n$ -th answer $A_{n}=y_{n,1},\dots,y_{n,|Y_{n}|}$ . Note that unlike Eq. (4), we feed all contextual information to the LSTM at every prediction step. This architecture is more flexible since the dimensions of encoder and decoder states can be different.

$g_{n}^{(q)}$ is encoded by another LSTM for the $n$ -th question, and $g_{n}^{(h)}$ is encoded with hierarchical LSTMs, where one LSTM encodes each question-answer pair and then the other LSTM summarizes the question-answer encodings into $g_{n}^{(h)}$ . The audio-visual encoding is obtained by multi-modal attention described in the next section.

2 Multimodal-attention based Video Features

To predict a word sequence in video description, prior work extracted content vectors from image features of VGG-16 and spatiotemporal motion features of C3D, and combined them into one vector in the fusion layer as:

and $c_{k,n}$ is a context vector obtained using the $k$ -th input modality.

We call this approach Naïve Fusion, in which multimodal feature vectors are combined using projection matrices $W_{ck}$ for $K$ different modalities (input sequences $x_{k1},\dots,x_{kL}$ for $k=1,\dots,K$ ).

To fuse multimodal information, prior work proposed method extends the attention mechanism. We call this fusion approach multimodal attention. The approach can pay attention to specific modalities of input based on the current state of the decoder to predict the word sequence in video description. The number of modalities indicating the number of sequences of input feature vectors is denoted by $K$ .

The following equation shows an approach to perform the attention-based feature fusion:

The similar mechanism for temporal attention is applied to obtain the multimodal attention weights ${\beta}_{k,n}$ :

Here the multimodal attention weights are determined by question encoding $g_{n}^{(q)}$ and the context vector of each modality $c_{k,n}$ as well as temporal attention weights in each modality. $W_{B}$ and $V_{Bk}$ are matrices, $w_{B}$ and $b_{Bk}$ are vectors, and $v_{k,n}$ is a scalar. The multimodal attention weights can change according to the question encoding and the feature vectors (shown in Figure 1).

This enables the decoder network to attend to a different set of features and/or modalities when predicting each subsequent word in the description. Naïve fusion can be considered a special case of Attentional fusion, in which all modality attention weights, $\beta_{k,n}$ , are constantly 1.

Experiments for Multimodal attention-based Video Features

To select best video features for the video scene-aware dialog system, we firstly evaluate the performance of video description using multimodal attention-based video features in this paper.

We evaluated our proposed feature fusion using the MSVD (YouTube2Text) , MSR-VTT , and Charades video data sets.

MSVD (YouTube2Text) covers a wide range of topics including sports, animals, and music. We applied the same condition defined by : a training set of 1,200 video clips, a validation set of 100 clips, and a test set of the remaining 670 clips.

MSR-VTT is split into training, validation, and testing sets of 6,513, 497, and 2,990 clips respectively. However, approvimatebly 12 of the MSR-VTT videos on YouTube have been removed. We used the available data consists of 5,763, 419, and 2,616 clips for train, validation, and test respectively defined by .

Charades is split into 7985 clips for training and 1863 clips for validation. provides 27,847 textual descriptions for the videos, As these textual descriptions are only available in the training and validation set, we report the evaluation results on the validation set.

Details of textual descriptions are summarized in Table 2.

2 Video Processing

We used a sequence of 4096-dimensional feature vectors of the output from the fully-connected fc7 layer of a VGG-16 network pretrained on the ImageNet dataset for the image features.

The pretrained C3D model is used to generate features for model motion and short-term spatiotemporal activity. The C3D network reads sequential frames in the video and outputs a fixed-length feature vector every 16 frames. 4096-dimensional features of activation vectors from fully-connected fc6-1 layer was applied to spatiotemporal features.

In addition to the VGG-16 and C3D features, we also adopted the state-of-the-art I3D features , spatiotemporal features that were developed for action recognition. The I3D model inflates the 2D filters and pooling kernels in the Inception V3 network along their temporal dimension, building 3D spatiotemporal ones. We used the output from the "Mixed_5c" layer of the I3D network to be used as video features in our framework. As a pre-processing step, we normalized all the video features to have zero mean and unit norm; the mean was computed over all the sequences in the training set for the respective feature.

In the experiments in this paper, we treated I3D-rgb (I3D features computed on a stack of 16 video frame images) and I3D-flow (I3D features computed on a stack of 16 frames of optical flow fields) as two separate modalities that are input to our multimodal attention model. To emphasize this, we refer to I3D in the results tables as I3D (rgb-flow).

3 Audio Processing

While the original MSVD (YouTube2Text) dataset does not contain audio features, we were able to collect audio data for 1,649 video clips (84% of the dataset) from the video URLs. In our previous work on multimodal attention for video description, we used two different types of audio features: concatenated mel-frequency cepstral coefficient (MFCC) features , and SoundNet features . In this paper, we also evaluate features extracted using a new state-of-the-art model, Audio Set VGGish .

Inspired by the VGG image classification architecture (Configuration A without the last group of convolutional/pooling layers), the Audio Set VGGish model operates on 0.96 s log Mel spectrogram patches extracted from 16 kHz audio, and outputs a 128-dimensional embedding vector. The model was trained to predict an ontology of labels from only the audio tracks of millions of YouTube videos. In this work, we overlap frames of input to the VGGish network by 50%, meaning an Audio Set VGGish feature vector is output every 0.48 s. For SoundNet , in which a fully convolutional architecture was trained to predict scenes and objects using a pretrained image model as a teacher, we take as input to the audio encoder the output of the second-to-last convolutional layer, which gives a 1024-dimensional feature vector every 0.67 s, and has a receptive field of approximately 4.16 s. For raw MFCC features, sequences of 13-dimensional MFCC features are extracted from 50 ms windows, every 25 ms, and then 20 consecutive frames are concatenated into a 260-dimensional vector and normalized to zero mean/unit variance (computed over the training set) and used as input to the BLSTM audio encoder.

4 Experimental Setup

The caption generation model, i.e., the decoder network, is trained to minimize the cross entropy criterion using the training set. Image features and deep audio features (SoundNet and VGGish) are fed to the decoder network through one projection layer of 512 units, while MFCC audio features are fed to a BLSTM encoder (one projection layer of 512 units and bidirectional LSTM layers of 512 cells) followed by the decoder network. The decoder network has one LSTM layer with 512 cells. Each word is embedded to a 256-dimensional vector when it is fed to the LSTM layer. In this video description task, we used L2 regularization for all experimental conditions and used RMSprop optimization.

5 Evaluation

The quality of the automatically generated sentences will be evaluated with objective measures to measure the similarity between the generated sentences and ground truth sentences. We will use the evaluation code for MS COCO caption generationhttps://github.com/tylin/coco-caption for objective evaluation of system outputs, which is a publicly available tool supporting various automated metrics for natural language generation such as BLEU, METEOR, ROUGE_L, and CIDEr.

6 Results and Discussion

Tables 3, 4, and 5 show the evaluation results on the MSVD (YouTube2Text), MSR-VTT Subset, and Charades datasets. The I3D spatiotemporal features outperformed the combination of VGG-16 image features and C3D spatiotemporal features. We also tried a combination of VGG-16 image features plus I3D spatiotemporal features, but we do not report those results because they did not improve performance over I3D features alone. We believe this is because I3D features already include enough image information for the video description task. In comparison to C3D, which uses the VGG-16 base architecture and was trained on the Sports-1M dataset , I3D uses a more powerful Inception-V3 network architecture and was trained on the larger (and cleaner) Kinectics dataset. As a result, I3D has demonstrated state-of-the-art performance for the task of human action recognition in video sequences . Further, the Inception-V3 architecture has significantly fewer network parameters than the VGG-16 network, making it more efficient.

In terms of audio features, the Audio Set VGGish model provided the best performance. While we expected the deep features (SoundNet and VGGish) to provide improved performance compared to MFCC, there are several possibilities as to why VGGish performed better than SoundNet. First, the VGGish model was trained on more data, and had audio specific labels, whereas SoundNet used pre-trained image classification networks to provide labels for training the audio network. Second, the large Audio Set ontology used to train VGGish likely provides the ability to learn features more relevant to text descriptions than the broad scene/object labels used by SoundNet.

Since it is intractable to enumerate all possible word sequences in vocabulary $\cal V$ , we usually limit them to the $n$ -best hypotheses generated by the system. Although in theory the distribution $P(Y^{\prime}|X)$ should be the true distribution, we instead estimate it using the encoder-decoder model.

Experiments for Video-scene-aware Dialog

In this paper, we extended an end-to-end dialog system to scene-aware dialog with multimodal fusion. As shown in Fig. 1, we embed the video and audio features selected in Section 2.

We evaluated our proposed system with the dialog data for Charades we collected. Table 1 shows the size of each data set. We compared the performance between models trained from various combinations of the QA text, visual and audio features. In addition, we tested an efficacy of multimodal-attention mechanism for dialg response generation. We employed an ADAM optimizer with the cross-entropy criterion and iterated the training process up to 20 epochs. For each of the encoder-decoder model types, we selected the model with the lowest perplexity on the expanded development set.

We used the parameters of the LSTMs with #layer=2 and #cells=128 for encoding history and question sentences. Video features were projected to 256 dimensional feature space before modality fusion. The decoder LSTM had a structure of #layer=2 and #cells=128 as well.

2 Evaluation Results

Table 6 shows the response sentence generation performance of our models, training and decoding methods using objective measures, BLEU1-4, METEOR, ROUGE_L, and CIDEr, which were computed with the evaluation code for MS COCO caption generation as done for video description. We investigated different input features including question-answering dialog history plus last question (QA), human-annotated captions (Captions), video features of VGG16 or I3D rgb and flow features (I3D), and audio features (VGGish).

First we evaluated response generation quality with only QA features as a baseline without any video scene features. Then, we added the caption features to QA, and the performance improved significantly. This is because each caption provided the scene information in natural language and helped the system answer the question correctly. However, such human annotations are not available for real systems.

Next we added VGG16 features to QA, but they did not increase the evaluation scores from those of QA-only features. This result indicates that QA+VGG16 is not enough to let the system generate better responses than those of QA+Captions. After that, we replaced VGG16 with I3D, and obtained a certain improvement from the QA-only case. As in the video description, it has been shown that the I3D features are also useful for scene-aware dialog. Furthermore, we applied the multi-modal attention mechanism (attentional fusion) for I3D rgb and flow features, and obtained further improvement in all the metrics.

Finally, we examined the efficacy of audio features. The table shows that VGGish obviously contributed to increasing the response quality especially when using the attentional fusion. The following example of system response was obtained with or without VGGish features, which worked better for the questions regarding audios:

Conclusion

In this paper, we propose a new research target, a dialog system that can discuss dynamic scenes with humans, which lies at the intersection of multiple avenues of research in natural language processing, computer vision, and audio processing. To advance this goal, we introduce a new model that incorporates technologies for multimodal attention-based video description into an end-to-end dialog system. We also introduce a new dataset of human dialogues about videos. Using this new dataset, we trained an end-to-end conversation model that generates system responses in a dialog about an input video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes. We are making our data set and model publicly available for a new Video Scene-Aware Dialog challenge.