End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim

Introduction

Video-to-language tasks, including video captioning and video question answering (QA) , are recent emerging challenges in computer vision research. This set of problems is interesting as one of frontiers in artificial intelligence; beyond that, it can also potentiate multiple practical applications, such as retrieving video content by users’ free-form queries or helping visually impaired people understand the visual content. Recently, a number of large-scale datasets have been introduced as a common ground for researchers to promote the progress of video-to-language research (e.g. ).

The objective of this work is to propose a concept word detector, as shown in Fig.1, which takes a training set of videos and associated sentences as input, and generates a list of high-level concept words per video as useful semantic priors for a variety of video-to-language tasks, including video captioning, retrieval, and question answering. We design our word detector to have the following two characteristics, to be easily integrated with any video-to-language models. First, it does not require any external knowledge sources for training. Instead, our detector learns the correlation between words in the captions and video regions from the whole training data. To this end, we use a continuous soft attention mechanism that traces consistent visual information across frames and associates them with concept words from captions. Second, the word detector is trainable in an end-to-end manner jointly with any video-to-language models. The loss function for learning the word detector can be plugged as an auxiliary term into the model’s overall cost function; as a result, we can reduce efforts to separately collect training examples and learn both models.

We also develop language model components to to effectively exploit the detected words. Inspired by semantic attention in image captioning research , we develop an attention mechanism that selectively focuses on the detected concept words and fuse them with word encoding and decoding in the language model. That is, the detected concept words are combined with input words to better represent the hidden states of encoders, and with output words to generate more accurate word prediction.

In order to demonstrate that the proposed word detector and attention mechanism indeed improve the performance of multiple video-to-language tasks, we participate in four tasks of LSMDC 2016 (Large Scale Movie Description Challenge) , which is one of the most active and successful benchmarks that advance the progress of video-to-language research. The challenges include movie description and multiple-choice test as video captioning, fill-in-the-blank as video question answering, and movie retrieval as video retrieval. Following the public evaluation protocol of LSMDC 2016, our approach achieves the best accuracies in the three tasks (fill-in-the-blank, multiple-choice test, and movie retrieval), and comparable performance in the other task (movie description).

Our work can be uniquely positioned in the context of two recent research directions in image/video captioning.

Image/Video Captioning with Word Detection. Image and video captioning has been actively studied in recent vision and language research, including , to name a few. Among them, there have been several attempts to detect a set of concept words or attributes from visual input to boost up the captioning performance. In image captioning research, Fang et al. exploit a multiple instance learning (MIL) approach to train visual detectors that identify a set of words with bounding boxed regions of the image. Based on the detected words, they retrieve and re-rank the best caption sentence for the image. Wu et al. use a CNN to learn a mapping between an image and semantic attributes. They then exploit the mapping as an input to the captioning decoder. They also extend the framework to explicitly leverage external knowledge base such as DBpedia for question answering tasks. Venugopalan et al. generate description with novel words beyond the ones in the training set, by leveraging external sources, including object recognition datasets like ImageNet and external text corpus like Wikipedia. You et al. also exploit weak labels and tags on Internet images to train additional parametric visual classifiers for image captioning.

In the video domain, it is more ambiguous to learn the relation between descriptive words and visual patterns. There have been only few work in video captioning. Rohrbach et al. propose a two-step approach for video captioning on the LSMDC dataset. They first extract verbs, objects, and places from movie description, and separately train SVM-based classifiers for each group. They then learn the LSTM decoder that generates text description based on the responses of these visual classifiers.

While almost all previous captioning methods exploit external classifiers for concept or attribute detection, the novelty of our work lies in that we use only captioning training data with no external sources to learn the word detector, and propose an end-to-end design for learning both word detection and caption generation simultaneously. Moreover, compared to video captioning work of where only movie description of LSMDC is addressed, this work is more comprehensive in that we validate the usefulness of our method for all the four tasks of LSMDC.

Attention for Captioning. Attention mechanism has been successfully applied to caption generation. One of the earliest works is that dynamically focuses on different image regions to produce an output word sequence. Later this soft attention has been extended as temporal attention over video frames for video captioning.

Beyond the attention on spatial or temporal structure of visual input, recently You et al. propose an attention on attribute words for image captioning. That is, the method enumerates a set of important object labels in the image, and then dynamically switch attention among these concept labels. Although our approach also exploits the idea of semantic attention, it bears two key differences. First, we extend the semantic attention to video domains for the first time, not only for video captioning but also for retrieval and question answering tasks. Second, the approach of relies on the classifiers that are separately learned from external datasets, whereas our approach is learnable end-to-end with only training data of captioning. It significantly reduces efforts to prepare for additional multi-label classifiers.

2 Contributions

We summarize the contributions of this work as follows.

(1) We propose a novel end-to-end learning approach for detecting a list of concept words and attend on them to enhance the performance of multiple video-to-language tasks. The proposed concept word detection and attention model can be plugged into any models of video captioning, retrieval, and question answering. Our technical novelties can be seen from two recent trends of image/video captioning research. First, our work is a first end-to-end trainable model not only for concept word detection but also for language generation. Second, our work is a first semantic attention model for video-to-language tasks.

(2) To validate the applicability of the proposed approach, we participate in all the four tasks of LSMDC 2016. Our models have won three of them, including fill-in-the-blank, multiple-choice test, and movie retrieval. We also attain comparable performance for movie description.

Detection of Concept Words from Videos

We first explain the pre-processing steps for representation of words and video frames. Then, we explain how we detect concept words for a given video.

Video Representation. We first equidistantly sample one per ten frames from a video, to reduce the frame redundancy while minimizing loss of information. We denote the number of video frames by $N$ . We limit the maximum number of frames to be $N_{max}=40$ ; if a video is too long, we use a wider interval for uniform sampling.

2 An Attention Model for Concept Detection

Concept Words and Traces. We propose the concept word detector using LSTM networks with soft attention mechanism. Its structure is shown in the red box of Fig.2. Its goal is, for a given video, to discover a list of concept words that consistently appear across frame regions. The detected concept words are used as additional references for video captioning models (section 3.1), which generates output sentence by selectively attending on those words.

We first define a set of candidate words with a size of $V$ from all training captions. Among them, we discover $K$ concept words per video. We set $V=2,000$ and $K=10$ . We first apply the automatic POS tagging of NLTK , to extract nouns, verbs and adjectives from all training caption sentences . We then compute the frequencies of those words in a training set, and select the $V$ most common words as concept word candidates.

where $\odot$ is elementwise product, and $\mbox{Conv}(\cdot)$ denotes two convolution operations before the softmax layer in Fig.2. Note that $\boldsymbol{\alpha}^{(l)}_{n}$ in Eq.(3) is computed from the previous hidden state $\mathbf{h}^{(l)}_{n-1}$ of the LSTM.

The spatial attention $\boldsymbol{\alpha}^{(l)}_{n}$ measures how each spatial grid location of visual features is related to the concept being tracked through tracing LSTMs. By repeating these two steps of Eq.(1)–(3) from $n=1$ to $N$ , our model can continuously find important and temporally consistent meanings over time, that are closely related to a part of video, rather than focusing on each video frame individually.

Finally, we predict the concept confidence vector $\mathbf{p}$ :

Strictly speaking, since we apply an end-to-end learning approach, the cost of Eq.(6) is used as an auxiliary term for the overall cost function, which will be discussed in section 3.

For inference, we compute $\mathbf{p}$ for a given query video, and find top $K$ words from the score $\mathbf{p}$ (i.e. $\operatornamewithlimits{argmax}_{1:K}\mathbf{p}$ ). Finally, we represent these $K$ concept words by their word embedding $\{\mathbf{a}_{i}\}_{i=1}^{K}$ .

Video-to-Language Models

We design a different base model for each of LSMDC tasks, while they share the concept word detector and the semantic attention mechanism. That is, we aim to validate that the proposed concept word detection is useful to a wide range of video-to-language models. For base models, we take advantage of state-of-the-art techniques, for which we do not argue as our contribution. We refer to our video-to-language models leveraging the concept word detector as CT-SAN (Concept-Tracing Semantic Attention Network).

For better understanding of our models, we outline the four LSMDC tasks as follows: (i) Movie description: generating a single descriptive sentence for a given movie clip, (ii) Fill-in-the-blank: given a video and a sentence with a single blank, finding a suitable word for the blank from the whole vocabulary set, (iii) Multiple-choice test: given a video query and five descriptive sentences, choosing the correct one out of them, and (iv) Movie retrieval: ranking 1,000 movie clips for a given natural language query.

We defer more model details to the supplementary file. Especially, we skip the description of multiple-choice and movie retrieval models in Figure 3(b)–(c), which can be found in the supplementary file.

Fig.2 shows the proposed video captioning model. It takes video features $\{\mathbf{v}_{n}\}_{n=1}^{N}$ and the detected concept words $\{\mathbf{a}_{i}\}_{i=1}^{K}$ as input, and produces a word sequence as output $\{\mathbf{y}_{t}\}_{t=1}^{T}$ . The model comprises video encoding and caption decoding LSTMs, and two semantic attention models. The two LSTM networks have two layers in depth, with layer normalization and dropout with a rate of 0.2.

Caption Decoder. The caption decoding LSTM is a normal LSTM network as follows:

Semantic Attention. Based on , our model in Fig.2 uses the semantic attention in two different parts, which are called as input and output semantic attention, respectively.

The input semantic attention $\phi$ computes an attention weight $\gamma_{t,i}$ , which is assigned to each predicted concept word $\mathbf{a}_{i}$ . It helps the caption decoding LSTM focus on different concept words dynamically at each step $t$ .

Finally, the probability of output word is obtained as

Training. To learn the parameters of the model, we define a loss function as the total negative log-likelihood of all the words, with regularization terms on attention weights $\{\mathbf{\alpha}_{t,i}\}$ , $\{\mathbf{\beta}_{t,i}\}$ , and $\{\mathbf{\gamma}_{t,i}\}$ , as well as the loss $\mathcal{L}_{con}$ for concept discovery (Eq.6):

where $\lambda_{1},\lambda_{2}$ are hyperparameters and $g$ is a regularization function with setting to $p=2,q=0.5$ as

For the rest of models, we transfer the parameters of the concept word detector trained with the description model, and allow the parameters being fine-tuned.

2 A Model for Fill-in-the-Blank

Fig.3(a) illustrates the proposed model for the fill-in-the-blank task. It is based on a bidirectional LSTM network (BLSTM) , which is useful in predicting a blank word from an imperfect sentence, since it considers the sequence in both forward and backward directions. Our key idea is to employ the semantic attention mechanism on both input and output of the BLSTM, to strengthen the meaning of input and output words with the detected concept words.

BLSTM. The input video is represented by the video encoding LSTM in Figure 2. The hidden state of the final video frame $\mathbf{s}_{N}$ is used to initialize the hidden states of the BLSTM: $\mathbf{h}^{b}_{T+1}=\mathbf{h}^{f}_{0}=\mathbf{s}_{N}$ , where $\{\mathbf{h}^{f}_{t}\}_{t=1}^{T}$ and $\{\mathbf{h}^{b}_{t}\}_{t=1}^{T}$ are the forward and backward hidden states of the BLSTM, respectively:

The output semantic attention is also similar to that of the captioning model in section 3.1, only except that we apply the attention only once at $t$ -th step where the token is taken as input. We feed the output of the BLSTM

Finally, the output word probability $\mathbf{y}$ given $\{\mathbf{c}_{t}\}_{t=1}^{T}$ is obtained via softmax on $\mathbf{p}$ as

Training. During training, we minimize the loss $\mathcal{L}$ as

where $\lambda_{1},\lambda_{2}$ are hyperparameters, and $g$ is the same regularization function of Eq.(15). Again, $\mathcal{L}_{con}$ is the cost of the concept word detector in Eq.(6).

Experiments

We report the experimental results of the proposed models for the four tasks of LSMDC 2016. More experimental results and implementation details can be found in the supplementary file.

The LSMDC 2016 comprises four video-to-language tasks on the LSMDC dataset, which contains a parallel corpus of 118,114 sentences and 118,081 video clips sampled from 202 movies. We strictly follow the evaluation protocols of the challenge. We defer more details of the dataset and challenge rules to and the challenge homepagehttps://sites.google.com/site/describingmovies/..

Movie Description. This task is related to video captioning; given a short video clip, its goal is to generate a single descriptive sentence. The challenge provides a subset of LSMDC dataset named LSMDC16. It is divided into training, validation, public test, and blind test set, whose sizes are 91,941, 6,542, 10,053, and 9,578, respectively. The official performance metrics include BLEU-1,2,3,4 , METEOR , ROUGE-L and CIDEr .

Multiple-Choice Test. Given a video query and five candidate captions, from which its goal is to find the best option. The correct answer is the GT caption of the query video, and four other distractors are randomly chosen from the other captions that have different activity-phrase labels from the correct answer. The evaluation metric is the percentage of correctly answered test questions out of 10,053 public-test data.

Movie Retrieval. The objective is, given a short query sentence, to search for its corresponding video out of 1,000 candidate videos, sampled from the LSMDC16 public-test data. The evaluation metrics include Recall@1/5/10, and Median Rank (MedR). The Recall@ $k$ means the percentage of the GT video included in the first $k$ retrieved videos, and the MedR indicates the median rank of the GT. Each algorithm predicts $1,000\times 1,000$ pairwise rank scores between phrases and videos, from which all the evaluation metrics are calculated.

Movie Fill-in-the-Blank. This task is related to visual question answering; given a video clip and a sentence with a blank in it, its goal is to predict a single correct word to fill in the blank. The test set includes 30,000 examples from 10,000 clips (i.e. about 3 examples per sentence). The evaluation metric is the prediction accuracy, which is the percentage of predicted words that match with GTs.

2 Quantitative Results

We compare with the results on the public dataset in the official evaluation server of LSMDC 2016 as of the submission deadline (i.e. November 15th, 2016 UTC 23:59). Except award winners, the LSMDC participants have no obligation to disclose their identities or used technique. Below we use the IDs in the leaderboard to denote participants.

Movie description. Table 1 compares the performance of movie description between different algorithms. Among comparable models, our approach ranks (5, 4, 1, 1)-th in the BLEU language metrics, and (2, 1, 1)-th in the other language metrics. That is, our approach ranks first in four metrics, which means that our approach is comparable to the state-of-the-art methods. In order to quantify the improvement by the proposed concept word detection and semantic attention, we implement a variant (Base-SAN), which is our model of Fig.2 without those two components. As shown in Table 1, the performance gaps between (CT-SAN) and (Base-SAN) are significant.

Movie Fill-in-the-Blank. Table 1 also shows the results of the fill-in-the-blank task. We test an ensemble of our models, denoted by (CT-SAN) (Ensemble); the answer word is obtained by averaging the output word probabilities of three identical models trained independently. Our approach outperforms all the participants with large margins. We also compare our model with a couple of baselines: (CT-SAN) outperforms the simple single-layer LSTM/BLSTM variants with the scoring layer on top of the blank location, and (Base-SAN), which is the base model of (CT-SAN) without the concept detector and semantic attention.

Movie Multiple-Choice Test. For the multiple-choice test, our approach also ranks first as shown in Table 2. As in the fill-in-the-blank, the multiple-choice task also benefits from the concept detector and semantic attention. Moreover, an ensemble of six models trained independently further improves the accuracy from 63.8% to 67.0%.

Movie Retrieval. Table 2 compares Recall@ $k$ (R@k) and Median Rank (MedR) metrics between different methods. We also achieve the best retrieval performance with significant margins from baselines. Our (CT-SAN) (Ensemble) obtains the video-sentence similarity matrix with an ensemble of two different models. First, we train six retrieval models with different parameter initializations. Second, we obtain the similarity matrix using the multiple-choice version of (CT-SAN), because it can also generate a similarity score for a video-sentence pair. Finally, we average the seven similarity matrices into the final similarity matrix.

3 Qualitative Results

Fig.4 illustrates qualitative results of our algorithm with correct or wrong examples for each task. In each set, we show sampled frames of a query video, groundtruth (GT), our prediction (Ours), and the detected concept words. We provide more examples in the supplementary file.

Movie Description. Fig.4(a)-(b) illustrates examples of our movie description. The predicted sentences are often related to the content of clips closely, but the words themselves are not always identical to the GTs. For instance, the generated sentence for Fig.4(b) reads the clock shows a minute, which is relevant to the video clip although its GT sentence much focuses on awards on a shelf. Nonetheless, the concept words relevant to the GT sentence are well detected such as office or clock.

Movie Fill-in-the-Blank. Fig.4(c) shows that the detected concept words are well matched with the content of the clip, and possibly help predict the correct answer. Fig.4(d) is a near-miss case where our model also predict a plausible answer (e.g. run instead of hurry).

Movie Multiple-Choice Test. Fig.4(e) shows that our concept detection successfully guides the model to select the correct answer. Fig.4(f) is an example of failure to understand the situation; the fifth candidate is chosen because it is overlapped with much of detected words such as hall, walk, go, although the correct answer is the second.

Movie Retrieval. Interestingly, the concept words of Fig.4(g) capture the abstract relation between swimming, water, and pool. Thus, the first to fifth retrieved clips include water. Fig.4(h) is a near-miss example in which our method fails to catch rare word like twitch and cocks. The first to fourth retrieved clips contain a woman’s head and mouth, yet miss to catch subtle movement of mouth.

Conclusion

We proposed an end-to-end trainable approach for detecting a list of concept words that can be used as semantic priors for multiple-video-to-language models. We also developed a semantic attention mechanism that effectively exploits the discovered concept words. We implemented our approach into multiple video-to-language models to participate in four tasks of LSMDC 2016. We demonstrated that our method indeed improved the performance of video captioning, retrieval, and question answering, and finally won three tasks in LSMDC 2016, including fill-in-the-blank, multiple-choice test, and movie retrieval.

Acknowledgements. This research is partially supported by Convergence Research Center through National Research Foundation of Korea (2015R1A5A7037676). Gunhee Kim is the corresponding author.

References

Appendix A Details of Video-to-Language Models

In this section, we describe the further details of video-to-language models (section 3).

Figure 5(b) illustrates the proposed model for the multiple-choice test. It takes a video and five choice sentences among which only one is the correct answer. Hence, our model computes the compatibility scores between the query video and five sentences, and selects the one with the highest score.

The multiple-choice model shares much resemblance to the model for fill-in-the-blank in Figure 5(a). First, it is based on the LSTM network, although it is not bi-directional. Second, it inputs the query video into the video encoding LSTM, and use its last hidden state $\mathbf{s}_{N}$ to initialize the following LSTM. Third, it uses the same word representation $\{\mathbf{c}_{t}\}_{t=1}^{T}$ for each candidate sentence. Finally, it exploits the same input semantic attention of Eq.(9)–(10), although it does not apply the output semantic attention because output is not a word but a score in this task.

We obtain a joint embedding of a pair of a single video and a sentence using the LSTM network:

Alignment Objective. The objective of the multiple-choice model is to assign high scores for the correctly matched video-sentence pairs but low scores for incorrect pairs. Therefore, we predict a similarity score $S_{kl}$ between a movie clip $k$ and a sentence $l$ as follows:

where $l^{*}$ denotes the answer sentence among the five candidates. This objective encourages a positive video-sentence pair to have a higher score than a misaligned negative pair by a margin $\Delta$ . We use $\Delta=1$ in our experiments.

At test, for a query video $k$ , we compute five scores $\{S_{k,l}\}_{l=1}^{5}$ of the candidate sentences, and select the one with maximum score $S_{k,l}$ as the answer.

A.2 A Model for Retrieval

Figure 5(c) illustrates our model for movie retrieval. The basic idea is to compute a score for a query text and video pair, by learning a joint representation between two modalities (i.e. query text and video) using the CBP (Compact Bilinear Pooling) layer .

For the video encoding, we use the final hidden state $\mathbf{s}_{N}$ of the video encoding LSTM as done in other models. We also obtain a query representation via input semantic attention like as in section A.1, through the LSTM network:

To measure a similarity score $S_{k,l}$ between a movie $k$ and a sentence $l$ as follows (see Figure 5(c)):

We use the same max-margin structured loss objective with the multiple-choice model:

which encourages a positive video-sentence pair to have a higher score than a misaligned pair by a margin $\Delta$ (e.g. $\Delta=3$ in our experiments).

At test, for a query sentence $k$ , we compute scores $\{S_{k,l}\}_{l}$ for all videos $l$ in the test set. From the score matrix, we can rank the videos for the query. As mentioned in section 4.2, an ensemble of multiple score matrices is used in our final model, which yields much better retreival performance.

Appendix B Experimental Details

Optimization. We train all of our models using the Adam optimizer to minimize the loss, with an initial learning rate in the range of $10^{-4}$ to $10^{-5}$ . We adopt the data augmentation of image mirroring. We also use batch shuffling in every training epoch. We use Xavier initialization for initializing the weight variables. For all models, the LSTM (BLSTM) networks are two-layered in depth, and we apply layer normalization and dropout with a rate of 0.2 to reduce overfitting.

During training of fill-in-the-blank, multiple-choice, and retrieval models, we initialize the parameters in the concept word detector component with a pre-trained model of the movie description task. The new parameters (e.g. $\mathbf{W}_{s},\mathbf{W}_{a}$ and the LSTM parameters for multi-choice test) are initialized randomly, and then the whole model is trained end-to-end using the provided training set.

Movie Description. The split of LSMDC16 dataset is provided by the challenge organizers: (training, validation, test, blind test set) $=(101079,9578,10053,7409)$ video-sentence pairs respectively. We train our model using the training set of this split, and the Para-Phrase AD sentences additionally provided by the challenge organizers.

Fill-in-the-blank. The LSMDC16 dataset for the fill-in-the-blank is splitted into (training, validation, test set) $=(296961,98483,30350)$ . We also train our model using the officially provided training set only. To improve prediction accuracy, we use an ensemble of models; the answer word is obtained by averaging the output word probabilities of three copies of models trained with different initializations.

Multiple-choice test. The training/validation/test split of LSMDC16 dataset is same as in the movie description task. Although it is possible to include more negative sentences other than the provided four distractors (we also find that it leads to a better accuracy), we experiment the models trained using the four distractors only. we simply average the score matrix $S_{k,l}$ of individual models, to obtain the ensembled score matrix. In our experiments, an ensemble of six copies of model trained independently, denoted by CT-SAN (Ensemble), shows a considerable improvement of accuracy.

Appendix C More Experimental Results

In this section, we provide additional experimental results to support the validity of the proposed concept word detector and semantic attention models.

To study the effect of quality of concept words, we present and experiment more baselines: (rand-SAN), (no-ATT-SAN), and (NN-SAN).

Random Concept Words. A baseline (rand-SAN) is a variant of the same structure as (CT-SAN), except that it uses random concept words instead of the ones detected by the concept word detector. We uniformly sample $K=10$ words for concept words, from the $V$ candidates.

which replaces Eq.(2) and Eq.(5), respectively. This baseline model simply transforms the video representation into concept words, but does not involve any spatial attention.

Nearest Neighbor. We also study a simpler baseline which use a nearest-neighbor method instead of concept word detector. This simple baseline is denoted by (NN-SAN). In this method, we simply take the concept words of the closest training video, in terms of ResNet video features averaged over time.

Quantitative Result. As shown in Table 3 and 4, the performance of (no-ATT-SAN) is better than (Base-SAN) and (NN-SAN), but poorer than the full model (CT-SAN), in all of the four tasks. This implies that the spatial attention helps detect concept words that are useful for video captioning. Especially, (CT-SAN) outperforms (no-ATT-SAN) in the fill-in-the-blank and the multi-choice tasks with a large margin. Nevertheless, using semantic attention turns out to be more helpful than not using it, as one can observe that (no-ATT-SAN) shows a better performance than (Base-SAN).

The performance of (rand-SAN) with semantic attention but with poor concept words, is much inferior to (Base-SAN), which even lacks semantic attention. As such, we find that the quality of concept words is crucial for performance enhancement. Besides, retrieved words from (NN-SAN) are not so helpful in training semantic attention network. (Decoupled CT-SAN) also shows worse performance than (CT-SAN) which is trained with an end-to-end manner. These suggest that joint learning the concept word detector and the task-specific network is effective in achieving a better performance.

C.2 Ablation Study

We conduct an additional ablation experiment on the semantic attention, and present the results of movie description and FITB (Fill-in-the-Blank) tasks in Table 6.

C.3 On the Number of Concept Words

We also conduct another simple experiment on the number of concept words. We compute the performance of (CT-SAN), with changing the number of detected concept words, $K\in\{5,10,20\}$ . As shown in Table 5, we observe only a marginal performance difference. However, as the number of concept words increases, the time required to train the whole model increases, and an overfitting is more prone to occur.

Appendix D More Examples and Qualitative Results

We visualize some examples of the spatial attention computed in the concept word detector in Figure 6–7. The spatial attentions roughly captures high-level concepts in the video (e.g. a blue car moving left to right, in Fig.6(a)). Figure 8–9 show some examples of generated movie description with the concept words detected by several baselines and our approach.

In the following, we present more examples of movie description results in Figure 10. Additional examples of the fill-in-the-blank task follows in Figure 11, and more examples of the multi-choice test are given in Figure 12. Finally, we present examples of the movie retrieval task in Figure 13–14. We also show each model’s output and the detected concept words correspondingly.