Multi-Task Video Captioning with Video and Entailment Generation

Ramakanth Pasunuru, Mohit Bansal

Introduction

Video captioning is the task of automatically generating a natural language description of the content of a video, as shown in Fig. 1. It has various applications such as assistance to a visually impaired person and improving the quality of online video search or retrieval. This task has gained recent momentum in the natural language processing and computer vision communities, esp. with the advent of powerful image processing features as well as sequence-to-sequence LSTM models. It is also a step forward from static image captioning, because in addition to modeling the spatial visual features, the model also needs to learn the temporal across-frame action dynamics and the logical storyline language dynamics.

Previous work in video captioning Venugopalan et al. (2015a); Pan et al. (2016b) has shown that recurrent neural networks (RNNs) are a good choice for modeling the temporal information in the video. A sequence-to-sequence model is then used to ‘translate’ the video to a caption. Venugopalan et al. (2016) showed linguistic improvements over this by fusing the decoder with external language models. Furthermore, an attention mechanism between the video frames and the caption words captures some of the temporal matching relations better Yao et al. (2015); Pan et al. (2016a). More recently, hierarchical two-level RNNs were proposed to allow for longer inputs and to model the full paragraph caption dynamics of long video clips Pan et al. (2016a); Yu et al. (2016).

Despite these recent improvements, video captioning models still suffer from the lack of sufficient temporal and logical supervision to be able to correctly capture the action sequence and story-dynamic language in videos, esp. in the case of short clips. Hence, they would benefit from incorporating such complementary directed knowledge, both visual and textual. We address this by jointly training the task of video captioning with two related directed-generation tasks: a temporally-directed unsupervised video prediction task and a logically-directed language entailment generation task. We model this via many-to-many multi-task learning based sequence-to-sequence models Luong et al. (2016) that allow the sharing of parameters among the encoders and decoders across the three different tasks, with additional shareable attention mechanisms.

The unsupervised video prediction task, i.e., video-to-video generation (adapted from Srivastava et al. (2015)), shares its encoder with the video captioning task’s encoder, and helps it learn richer video representations that can predict their temporal context and action sequence. The entailment generation task, i.e., premise-to-entailment generation (based on the image caption domain SNLI corpus Bowman et al. (2015)), shares its decoder with the video captioning decoder, and helps it learn better video-entailed caption representations, since the caption is essentially an entailment of the video, i.e., it describes subsets of objects and events that are logically implied by (or follow from) the full video content. The overall many-to-many multi-task model combines all three tasks.

Our three novel multi-task models show statistically significant improvements over the state-of-the-art, and achieve the best-reported results (and rank) on multiple datasets, based on several automatic and human evaluations. We also demonstrate that video captioning, in turn, gives mutual improvements on the new multi-reference entailment generation task.

Related Work

Early video captioning work Guadarrama et al. (2013); Thomason et al. (2014); Huang et al. (2013) used a two-stage pipeline to first extract a subject, verb, and object (S,V,O) triple and then generate a sentence based on it. Venugopalan et al. (2015b) fed mean-pooled static frame-level visual features (from convolution neural networks pre-trained on image recognition) of the video as input to the language decoder. To harness the important frame sequence temporal ordering, Venugopalan et al. (2015a) proposed a sequence-to-sequence model with video encoder and language decoder RNNs.

More recently, Venugopalan et al. (2016) explored linguistic improvements to the caption decoder by fusing it with external language models. Moreover, an attention or alignment mechanism was added between the encoder and the decoder to learn the temporal relations (matching) between the video frames and the caption words Yao et al. (2015); Pan et al. (2016a). In contrast to static visual features, Yao et al. (2015) also considered temporal video features from a 3D-CNN model pre-trained on an action recognition task.

To explore long range temporal relations, Pan et al. (2016a) proposed a two-level hierarchical RNN encoder which limits the length of input information and allows temporal transitions between segments. Yu et al. (2016)’s hierarchical RNN generates sentences at the first level and the second level captures inter-sentence dependencies in a paragraph. Pan et al. (2016b) proposed to simultaneously learn the RNN word probabilities and a visual-semantic joint embedding space that enforces the relationship between the semantics of the entire sentence and the visual content. Despite these useful recent improvements, video captioning still suffers from limited supervision and generalization capabilities, esp. given the complex action-based temporal and story-based logical dynamics that need to be captured from short video clips. Our work addresses this issue by bringing in complementary temporal and logical knowledge from video prediction and textual entailment generation tasks (respectively), and training them together via many-to-many multi-task learning.

Multi-task learning is a useful learning paradigm to improve the supervision and the generalization performance of a task by jointly training it with related tasks Caruana (1998); Argyriou et al. (2007); Kumar and Daumé III (2012). Recently, Luong et al. (2016) combined multi-task learning with sequence-to-sequence models, sharing parameters across the tasks’ encoders and decoders. They showed improvements on machine translation using parsing and image captioning. We additionally incorporate an attention mechanism to this many-to-many multi-task learning approach and improve the multimodal, temporal-logical video captioning task by sharing its video encoder with the encoder of a video-to-video prediction task and by sharing its caption decoder with the decoder of a linguistic premise-to-entailment generation task.

Image representation learning has been successful via supervision from very large object-labeled datasets. However, similar amounts of supervision are lacking for video representation learning. Srivastava et al. (2015) address this by proposing unsupervised video representation learning via sequence-to-sequence RNN models, where they reconstruct the input video sequence or predict the future sequence. We model video generation with an attention-enhanced encoder-decoder and harness it to improve video captioning.

The task of recognizing textual entailment (RTE) is to classify whether the relationship between a premise and hypothesis sentence is that of entailment (i.e., logically follows), contradiction, or independence (neutral), which is helpful for several downstream NLP tasks. The recent Stanford Natural Language Inference (SNLI) corpus by Bowman et al. (2015) allowed training end-to-end neural networks that outperform earlier feature-based RTE models Lai and Hockenmaier (2014); Jimenez et al. (2014). However, directly generating the entailed hypothesis sentences given a premise sentence would be even more beneficial than retrieving or reranking sentence pairs, because most downstream generation tasks only come with the source sentence and not pairs. Recently, Kolesnyk et al. (2016) tried a sequence-to-sequence model for this on the original SNLI dataset, which is a single-reference setting and hence restricts automatic evaluation. We modify the SNLI corpus to a new multi-reference (and a more challenging zero train-test premise overlap) setting, and present a novel multi-task training setup with the related video captioning task (where the caption is also entailed by the video), showing mutual improvements on both the tasks.

Models

We first discuss a simple encoder-decoder model as a baseline reference for video captioning. Next, we improve this via an attention mechanism. Finally, we present similar models for the unsupervised video prediction and entailment generation tasks, and then combine them with video captioning via the many-to-many multi-task approach.

Our baseline model is similar to the standard machine translation encoder-decoder RNN model Sutskever et al. (2014) where the final state of the encoder RNN is input as an initial state to the decoder RNN, as shown in Fig. 2. The RNN is based on Long Short Term Memory (LSTM) units, which are good at memorizing long sequences due to forget-style gates Hochreiter and Schmidhuber (1997). For video captioning, our input to the encoder is the video frame featuresWe use several popular image features such as VGGNet, GoogLeNet and Inception-v4. Details in Sec. 4.1. {f1,f2,...,fn}\{f_{1},f_{2},...,f_{n}\} of length nn, and the caption word sequence {w1,w2,...,wm}\{w_{1},w_{2},...,w_{m}\} of length mm is generated during the decoding phase. The distribution of the output sequence w.r.t. the input sequence is:

where htdh^{d}_{t} is the hidden state at the ttht^{th} time step of the decoder RNN, obtained from ht1dh_{t-1}^{d} and wt1w_{t-1} via the standard LSTM-RNN equations. The distribution p(wthtd)p(w_{t}|h_{t}^{d}) is given by softmax over all the words in the vocabulary.

2 Attention-based Model

Our attention model architecture is similar to Bahdanau et al. (2015), with a bidirectional LSTM-RNN as the encoder and a unidirectional LSTM-RNN as the decoder, see Fig. 3. At each time step tt, the decoder LSTM hidden state htdh^{d}_{t} is a non-linear recurrent function of the previous decoder hidden state ht1dh^{d}_{t-1}, the previous time-step’s generated word wt1w_{t-1}, and the context vector ctc_{t}:

where ctc_{t} is a weighted sum of encoder hidden states {hie}\{h^{e}_{i}\}:

These attention weights {αt,i\alpha_{t,i}} act as an alignment mechanism by giving higher weights to certain encoder hidden states which match that decoder time step better, and are computed as:

where the attention function et,ie_{t,i} is defined as:

where ww, WaeW^{e}_{a}, WadW^{d}_{a}, and bab_{a} are learned parameters. This attention-based sequence-to-sequence model (Fig. 3) is our enhanced baseline for video captioning. We next discuss similar models for the new tasks of unsupervised video prediction and entailment generation and then finally share them via multi-task learning.

3 Unsupervised Video Prediction

We model unsupervised video representation by predicting the sequence of future video frames given the current frame sequence. Similar to Sec. 3.2, a bidirectional LSTM-RNN encoder and an LSTM-RNN decoder is used, along with attention. If the frame level features of a video of length nn are {f1,f2,...,fn}\{f_{1},f_{2},...,f_{n}\}, these are divided into two sets such that given the current frames {f1,f2,..,fk}\{f_{1},f_{2},..,f_{k}\} (in its encoder), the model has to predict (decode) the rest of the frames {fk+1,fk+2,..,fn}\{f_{k+1},f_{k+2},..,f_{n}\}. The motivation is that this helps the video encoder learn rich temporal representations that are aware of their action-based context and are also robust to missing frames and varying frame lengths or motion speeds. The optimization function is defined as:

where ϕ\phi are the model parameters, ft+kf_{t+k} is the true future frame feature at decoder time step tt and ftdf_{t}^{d} is the decoder’s predicted future frame feature at decoder time step tt, defined as:

similar to Eqn. 2, with ht1dh_{t-1}^{d} and ft1df_{t-1}^{d} as the previous time step’s hidden state and predicted frame feature respectively, and ctc_{t} as the attention-weighted context vector.

4 Entailment Generation

Given a sentence (premise), the task of entailment generation is to generate a sentence (hypothesis) which is a logical deduction or implication of the premise. Our entailment generation model again uses a bidirectional LSTM-RNN encoder and LSTM-RNN decoder with an attention mechanism (similar to Sec. 3.2). If the premise sps^{p} is a sequence of words {w1p,w2p,...,wnp}\{w_{1}^{p},w_{2}^{p},...,w_{n}^{p}\} and the hypothesis shs^{h} is {w1h,w2h,...,wmh}\{w_{1}^{h},w_{2}^{h},...,w_{m}^{h}\}, the distribution of the entailed hypothesis w.r.t. the premise is:

where the distribution p(wthhtd)p(w_{t}^{h}|h_{t}^{d}) is again obtained via softmax over all the words in the vocabulary and the decoder state htdh_{t}^{d} is similar to Eqn. 2.

5 Multi-Task Learning

Multi-task learning helps in sharing information between different tasks and across domains. Our primary aim is to improve the video captioning model, where visual content translates to a textual form in a directed (entailed) generation way. Hence, this presents an interesting opportunity to share temporally and logically directed knowledge with both visual and linguistic generation tasks. Fig. 4 shows our overall many-to-many multi-task model for jointly learning video captioning, unsupervised video prediction, and textual entailment generation. Here, the video captioning task shares its video encoder (parameters) with the encoder of the video prediction task (one-to-many setting) so as to learn context-aware and temporally-directed visual representations (see Sec. 3.3).

Moreover, the decoder of the video captioning task is shared with the decoder of the textual entailment generation task (many-to-one setting), thus helping generate captions that can be ‘entailed’ by, i.e., are logically implied by or follow from the video content (see Sec. 3.4).Empirically, logical entailment helped captioning more than simple fusion with language modeling (i.e., partial sentence completion with no logical implication), because a caption is also ‘entailed’ by a video in a logically-directed sense and hence the entailment generation task matches the video captioning task better than language modeling. Moreover, a multi-task setup is more suitable to add directed information such as entailment (as opposed to pretraining or fusion with only the decoder). Details in Sec. 5.1. In both the one-to-many and the many-to-one settings, we also allow the attention parameters to be shared or separated. The overall many-to-many setting thus improves both the visual and language representations of the video captioning model.

We train the multi-task model by alternately optimizing each task in mini-batches based on a mixing ratio. Let αv\alpha_{v}, αf\alpha_{f}, and αe\alpha_{e} be the number of mini-batches optimized alternately from each of these three tasks – video captioning, unsupervised video future frames prediction, and entailment generation, resp. Then the mixing ratio is defined as αv(αv+αf+αe):αf(αv+αf+αe):αe(αv+αf+αe)\frac{\alpha_{v}}{(\alpha_{v}+\alpha_{f}+\alpha_{e})}:\frac{\alpha_{f}}{(\alpha_{v}+\alpha_{f}+\alpha_{e})}:\frac{\alpha_{e}}{(\alpha_{v}+\alpha_{f}+\alpha_{e})}.

Experimental Setup

We report results on three popular video captioning datasets. First, we use the YouTube2Text or MSVD Chen and Dolan (2011) for our primary results, which contains 19701970 YouTube videos in the wild with several different reference captions per video (4040 on average). We also use MSR-VTT Xu et al. (2016) with 10,00010,000 diverse video clips (from a video search engine) – it has 200,000200,000 video clip-sentence pairs and around 2020 captions per video; and M-VAD Torabi et al. (2015) with 49,00049,000 movie-based video clips but only 11 or 22 captions per video, making most evaluation metrics (except paraphrase-based METEOR) infeasible. We use the standard splits for all three datasets. Further details about all these datasets are provided in the supplementary.

For our unsupervised video representation learning task, we use the UCF-101 action videos dataset Soomro et al. (2012), which contains 13,32013,320 video clips of 101101 action categories, and suits our video captioning task well because it also contains short video clips of a single action or few actions. We use the standard splits – further details in supplementary.

For the entailment generation encoder-decoder model, we use the Stanford Natural Language Inference (SNLI) corpus Bowman et al. (2015), which contains human-annotated English sentence pairs with classification labels of entailment, contradiction and neutral. It has a total of 570,152570,152 sentence pairs out of which 190,113190,113 correspond to true entailment pairs, and we use this subset in our multi-task video captioning model. For improving video captioning, we use the same training/validation/test splits as provided by Bowman et al. (2015), which is 183,416183,416 training, 3,3293,329 validation, and 3,3683,368 testing pairs (for the entailment subset).

However, for the entailment generation multi-task results (see results in Sec. 5.3), we modify the splits so as to create a multi-reference setup which can afford evaluation with automatic metrics. A given premise usually has multiple entailed hypotheses but the original SNLI corpus is set up as single-reference (for classification). Due to this, the different entailed hypotheses of the same premise land up in different splits of the dataset (e.g., one in train and one in test/validation) in many cases. Therefore, we regroup the premise-entailment pairs and modify the split as follows: among the 190,113190,113 premise-entailment pairs subset of the SNLI corpus, there are 155,898155,898 unique premises; out of which 145,822145,822 have only one hypothesis and we make this the training set, and the rest of them (10,07610,076) have more than one hypothesis, which we randomly shuffle and divide equally into test and validation sets, so that each of these two sets has approximately the same distribution of the number of reference hypotheses per premise.

These new validation and test sets hence contain premises with multiple entailed hypotheses as ground truth references, thus allowing for automatic metric evaluation, where differing generations still get positive scores by matching one of the multiple references. Also, this creates a more challenging dataset for entailment generation because of zero premise overlap between the training and val/test sets. We will make these split details publicly available.

For the three video captioning and UCF-101 datasets, we fix our sampling rate to 3fps3fps to bring uniformity in the temporal representation of actions across all videos. These sampled frames are then converted into features using several state-of-the-art pre-trained models on ImageNet Deng et al. (2009) – VGGNet Simonyan and Zisserman (2015), GoogLeNet Szegedy et al. (2015); Ioffe and Szegedy (2015), and Inception-v4 Szegedy et al. (2016). Details of these feature dimensions and layer positions are in the supplementary.

2 Evaluation (Automatic and Human)

For our video captioning as well as entailment generation results, we use four diverse automatic evaluation metrics that are popular for image/video captioning and language generation in general: METEOR Denkowski and Lavie (2014), BLEU-4 Papineni et al. (2002), CIDEr-D Vedantam et al. (2015), and ROUGE-L Lin (2004). Particularly, METEOR and CIDEr-D have been justified to be better for generation tasks, because CIDEr-D uses consensus among the (large) number of references and METEOR uses soft matching based on stemming, paraphrasing, and WordNet synonyms. We use the standard evaluation code from the Microsoft COCO server Chen et al. (2015) to obtain these results and also to compare the results with previous papers.We use avg. of these four metrics on validation set to choose the best model, except for single-reference M-VAD dataset where we only report and choose based on METEOR.

We also present human evaluation results based on relevance (i.e., how related is the generated caption w.r.t. the video contents such as actions, objects, and events; or is the generated hypothesis entailed or implied by the premise) and coherence (i.e., a score on the logic, readability, and fluency of the generated sentence).

3 Training Details

We tune all hyperparameters on the dev splits: LSTM-RNN hidden state size, learning rate, weight initializations, and mini-batch mixing ratios (tuning ranges in supplementary). We use the following settings in all of our models (unless otherwise specified): we unroll video encoder/decoder RNNs to 5050 time steps and language encoder/decoder RNNs to 3030 time steps. We use a 1024-dimension RNN hidden state size and 512512-dim vectors to embed visual features and word vectors. We use Adam optimizer Kingma and Ba (2015). We apply a dropout of 0.50.5. See subsections below and supp for full details.

Results and Analysis

Table 1 presents our primary results on the YouTube2Text (MSVD) dataset, reporting several previous works, all our baselines and attention model ablations, and our three multi-task models, using the four automated evaluation metrics. For each subsection below, we have reported the important training details inline, and refer to the supplementary for full details (e.g., learning rates and initialization).

We first present all our baseline model choices (ablations) in Table 1. Our baselines represent the standard sequence-to-sequence model with three different visual feature types as well as those with attention mechanisms. Each baseline model is trained with three random seed initializations and the average is reported (for stable results). The final baseline model \otimes instead uses an ensemble (E), which is a standard denoising method Sutskever et al. (2014) that performs inference over ten randomly initialized models, i.e., at each time step tt of the decoder, we generate a word based on the avg. of the likelihood probabilities from the ten models. Moreover, we use beam search with size 55 for all baseline models. Overall, the final baseline model with Inception-v4 features, attention, and 10-ensemble performs well (and is better than all previous state-of-the-art), and so we next add all our novel multi-task models on top of this final baseline.

Here, the video captioning and unsupervised video prediction tasks share their encoder LSTM-RNN weights and image embeddings in a one-to-many multi-task setting. Two important hyperparameters tuned (on the validation set of captioning datasets) are the ratio of encoder vs decoder frames for video prediction on UCF-101 (where we found that 80%80\% of frames as input and 20%20\% for prediction performs best); and the mini-batch mixing ratio between the captioning and video prediction tasks (where we found 100:200100:200 works well). Table 1 shows a statistically significant improvementStatistical significance of p<0.01p<0.01 for CIDEr-D and ROUGE-L, p<0.02p<0.02 for BLEU-4, p<0.03p<0.03 for METEOR, based on the bootstrap test Noreen (1989); Efron and Tibshirani (1994) with 100K samples. in all metrics in comparison to the best baseline (non-multitask) model as well as w.r.t. all previous works, demonstrating the effectiveness of multi-task learning for video captioning with video prediction, even with unsupervised signals.

Here, the video captioning and entailment generation tasks share their language decoder LSTM-RNN weights and word embeddings in a many-to-one multi-task setting. We observe that a mixing ratio of 100:50100:50 alternating mini-batches (between the captioning and entailment tasks) works well here. Again, Table 1 shows statistically significant improvementsStatistical significance of p<0.01p<0.01 for all four metrics. in all the metrics in comparison to the best baseline model (and all previous works) under this multi-task setting. Note that in our initial experiments, our entailment generation model helped the video captioning task significantly more than the alternative approach of simply improving fluency by adding (or deep-fusing) an external language model (or pre-trained word embeddings) to the decoder (using both in-domain and out-of-domain language models), again because a caption is also ‘entailed’ by a video in a logically-directed sense and hence this matches our captioning task better (also see results of Venugopalan et al. (2016) in Table 1).

Combining the above one-to-many and many-to-one multi-task learning models, our full model is the 3-task, many-to-many model (Fig. 4) where both the video encoder and the language decoder of the video captioning model are shared (and hence improved) with that of the unsupervised video prediction and entailment generation models, respectively.We found the setting with unshared attention parameters to work best, likely because video captioning and video prediction prefer very different alignment distributions. A mixing ratio of 100:100:50100:100:50 alternate mini-batches of video captioning, unsupervised video prediction, and entailment generation, resp. works well. Table 1 shows that our many-to-many multi-task model again outperforms our strongest baseline (with statistical significance of p<0.01p<0.01 on all metrics), as well as all the previous state-of-the-art results by large absolute margins on all metrics. It also achieves significant improvements on some metrics over the one-to-many and many-to-one models.Many-to-many model’s improvements have a statistical significance of p<0.01p<0.01 on all metrics w.r.t. baseline, and p<0.01p<0.01 on CIDEr-D w.r.t. both one-to-many and many-to-one models, and p<0.04p<0.04 on METEOR w.r.t. one-to-many. Overall, we achieve the best results to date on YouTube2Text (MSVD) on all metrics.

2 Video Captioning on MSR-VTT, M-VAD

In Table 2, we also train and evaluate our final many-to-many multi-task model on two other video captioning datasets (using their standard splits; details in supplementary). First, we evaluate on the new MSR-VTT dataset Xu et al. (2016). Since this is a recent dataset, we list previous works’ results as reported by the MSR-VTT dataset paper itself.In their updated supplementary at https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/cvpr16.supplementary.pdf We improve over all of these significantly. Moreover, they maintain a leaderboardhttp://ms-multimedia-challenge.com/leaderboard on this dataset and we also report the top 3 systems from it. Based on their ranking method, our multi-task model achieves the new rank 1 on this leaderboard. In Table 3, we further evaluate our model on the challenging movie-based M-VAD dataset, and again achieve improvements over all previous work Venugopalan et al. (2015a); Pan et al. (2016a); Yao et al. (2015).Following previous work, we only use METEOR because M-VAD only has a single reference caption per video.

3 Entailment Generation Results

Above, we showed that the new entailment generation task helps improve video captioning. Next, we show that the video captioning task also inversely helps the entailment generation task. Given a premise, the task of entailment generation is to generate an entailed hypothesis. We use only the entailment pairs subset of the SNLI corpus for this, but with a multi-reference split setup to allow automatic metric evaluation and a zero train-test premise overlap (see Sec. 4.1). All the hyperparameter details (again tuned on the validation set) are presented in the supplementary. Table 4 presents the entailment generation results for the baseline (sequence-to-sequence with attention, 3-ensemble, beam search) and the multi-task model which uses video captioning (shared decoder) on top of the baseline. A mixing ratio of 100:20100:20 alternate mini-batches of entailment generation and video captioning (resp.) works well.Note that this many-to-one model prefers a different mixing ratio and learning rate than the many-to-one model for improving video captioning (Sec. 5.1), because these hyperparameters depend on the primary task being improved, as also discussed in previous work Luong et al. (2016). The multi-task model achieves stat. significant (p<0.01p<0.01) improvements over the baseline on all metrics, thus demonstrating that video captioning and entailment generation both mutually help each other.

4 Human Evaluation

In addition to the automated evaluation metrics, we present pilot-scale human evaluations on the YouTube2Text (Table 1) and entailment generation (Table 4) results. In each case, we compare our strongest baseline with our final multi-task model (M-to-M in case of video captioning and M-to-1 in case of entailment generation). We evaluate a random sample of 300300 generated captions (or entailed hypotheses) from the test set, across three human evaluators. We remove the model identity to anonymize the two models, and ask the human evaluators to choose the better model based on relevance and coherence (described in Sec. 4.2). As shown in Table 5 and Table 6, the multi-task models are always better than the strongest baseline for both video captioning and entailment generation, on both relevance and coherence, and with similar improvements (2-7%) as the automatic metrics (shown in Table 1).

5 Analysis

Fig. 5 shows video captioning generation results on the YouTube2Text dataset where our final M-to-M multi-task model is compared with our strongest attention-based baseline model for three categories of videos: (a) complex examples where the multi-task model performs better than the baseline; (b) ambiguous examples (i.e., ground truth itself confusing) where multi-task model still correctly predicts one of the possible categories (c) complex examples where both models perform poorly. Overall, we find that the multi-task model generates captions that are better at both temporal action prediction and logical entailment (i.e., correct subset of full video premise) w.r.t. the ground truth captions. The supplementary also provides ablation examples of improvements by the 1-to-M video prediction based multi-task model alone, as well as by the M-to-1 entailment based multi-task model alone (over the baseline).

On analyzing the cases where the baseline is better than the final M-to-M multi-task model, we find that these are often scenarios where the multi-task model’s caption is also correct but the baseline caption is a bit more specific, e.g., “a man is holding a gun” vs “a man is shooting a gun”.

Finally, Table 7 presents output examples of our entailment generation multi-task model (Sec. 5.3), showing how the model accurately learns to produce logically implied subsets of the premise.

Conclusion

We presented a multimodal, multi-task learning approach to improve video captioning by incorporating temporally and logically directed knowledge via video prediction and entailment generation tasks. We achieve the best reported results (and rank) on three datasets, based on multiple automatic and human evaluations. We also show mutual multi-task improvements on the new entailment generation task. In future work, we are applying our entailment-based multi-task paradigm to other directed language generation tasks such as image captioning and document summarization.

Acknowledgments

We thank the anonymous reviewers for their helpful comments. This work was partially supported by a Google Faculty Research Award, an IBM Faculty Award, a Bloomberg Data Science Research Grant, and NVidia GPU awards.

Appendix A Experimental Setup

The Microsoft Research Video Description Corpus (MSVD) or YouTube2Text Chen and Dolan (2011) is used for our primary video captioning experiments. It has 19701970 YouTube videos in the wild with many diverse captions in multiple languages for each video. Caption annotations to these videos are collected using Amazon Mechanical Turk (AMT). All our experiments use only English captions. On average, each video has 4040 captions, and the overall dataset has about 80,00080,000 unique video-caption pairs. The average clip duration is roughly 1010 seconds. We used the standard split as stated in Venugopalan et al. (2015a), i.e., 12001200 videos for training, 100100 videos for validation, and 670670 for testing.

MSR-VTT is a recent collection of 10,00010,000 video clips of 41.241.2 hours duration (i.e., average duration of 1515 seconds), which are annotated by AMT workers. It has 200,000200,000 video clip-sentence pairs covering diverse content from a commercial video search engine. On average, each clip is annotated with 2020 natural language captions. We used the standard split as provided in Xu et al. (2016), i.e., 6,5136,513 video clips for training, 497497 for validation, and 2,9902,990 for testing.

M-VAD is a movie description dataset with 49,00049,000 video clips collected from 9292 movies, with the average clip duration being 66 seconds. Alignment of descriptions to video clips is done through an automatic procedure using Descriptive Video Service (DVS) provided for the movies. Each video clip description has only 11 or 22 sentences, making most evaluation metrics (except paraphrase-based METEOR) infeasible. Again, we used the standard train/val/test split as provided in Torabi et al. (2015).

A.1.2 Video Prediction Dataset

For our unsupervised video representation learning task, we use the UCF-101 action videos dataset Soomro et al. (2012), which contains 13,32013,320 video clips of 101101 action categories and with an average clip length of 7.217.21 seconds each. This dataset suits our video captioning task well because both contain short video clips of a single action or few actions, and hence using future frame prediction on UCF-101 helps learn more robust and context-aware video representations for our short clip video captioning task. We use the standard split of 9,5009,500 videos for training (we don’t need any validation set in our setup because we directly tune on the validation set of the video captioning task).

A.2 Pre-trained Visual Frame Features

For the three video captioning datasets (Youtube2Text, MSR-VTT, M-VAD) and the unsupervised video prediction dataset (UCF-101), we fix our sampling rate to 3fps3fps to bring uniformity in the temporal representation of actions across all videos. These sampled frames are then converted into features using several state-of-the-art pre-trained models on ImageNet Deng et al. (2009) – VGGNet Simonyan and Zisserman (2015), GoogLeNet Szegedy et al. (2015); Ioffe and Szegedy (2015), and Inception-v4 Szegedy et al. (2016). For VGGNet, we use its fc7fc7 layer features with dimension 40964096. For GoogLeNet and Inception-v4, we use the layer before the fully connected layer with dimensions 10241024 and 15361536, respectively. We follow standard preprocessing and convert all the natural language descriptions to lower case and tokenize the sentences and remove punctuations.

Appendix B Training Details

In all of our experiments, we tune all the model hyperparameters on validation (development) set of the corresponding dataset. We consider the following short hyperparameters ranges and tune lightly on: LSTM-RNN hidden state size - {256,512,1024}\{256,512,1024\}; learning rate in the range [105,102][10^{-5},10^{-2}] with uniform intervals on a log-scale; weight initializations in the range [0.1,0.1][-0.1,0.1] and mixing ratios in the range 11:[0.01,3][0.01,3] with uniform intervals on a log-scale. We use the following settings in all of our models (unless otherwise specified in a subsection below): we unroll video encoder/decoder LSTM-RNNs to 5050 time steps and language encoder/decoder LSTM-RNNs to 3030 time steps. We use a 1024-dimension LSTM-RNN hidden state size. We use 512512-dimension vectors to embed frame level visual features and word vectors. These embedding weights are learned during the training. We use the Adam optimizer Kingma and Ba (2015) with default coefficients and a batch size of 3232. We apply a dropout with probability 0.50.5 to the vertical connections of LSTM Zaremba et al. (2014) to reduce overfitting.

Our primary baseline model (Inception-v4, attention, ensemble) uses a learning rate of 0.00010.0001 and initializes all its weights with a uniform distribution in the range [0.05,0.05][-0.05,0.05].

B.1.2 Multi-Task with Video Prediction (1-to-M)

In this model, the video captioning and unsupervised video prediction tasks share their encoder LSTM-RNN weights and image embeddings in a one-to-many multi-task setting. We again use a learning rate of 0.00010.0001 and initialize all the learnable weights with a uniform distribution in the range [0.05,0.05][-0.05,0.05]. Two important hyperparameters tuned (on the validation set of captioning datasets) are the ratio of encoder vs decoder frames for video prediction on UCF-101 (where we found that 80%80\% of frames as input and 20%20\% for prediction performs best); and the mini-batch mixing ratio between the captioning and video prediction tasks (where we found 100:200100:200 works well).

B.1.3 Multi-Task with Entailment Generation (M-to-1)

In this model, the video captioning and entailment generation tasks share their language decoder LSTM-RNN weights and word embeddings in a many-to-one multi-task setting. We again use a learning rate of 0.00010.0001. All the trainable weights are initialized with a uniform distribution in the range [0.08,0.08][-0.08,0.08]. We observe that a mixing ratio of 100:50100:50 (between the captioning and entailment generation tasks) alternating mini-batches works well here.

B.1.4 Multi-Task with Video and Entailment Generation (M-to-M)

In this many-to-many, three-task model, the video encoder is shared between the video captioning and unsupervised video prediction tasks, and the language decoder is shared between the video captioning and entailment generation tasks. We again use a learning rate of 0.00010.0001. All the trainable weights are initialized with a uniform distribution in the range [0.08,0.08][-0.08,0.08]. We found that a mixing ratio of 100:100:50100:100:50 alternative mini-batches of video captioning, unsupervised video prediction, and entailment generation works best.

B.2 Video Captioning on MSR-VTT

We also evaluate our many-to-many multi-task model on other video captioning datasets. For MSR-VTT, we train the model again using a learning rate of 0.00010.0001. All the trainable weights are initialized with a uniform distribution in the range [0.05,0.05][-0.05,0.05]. We found that a mixing ratio of 100:20:20100:20:20 alternative mini-batches of video captioning, unsupervised video prediction, and entailment generation works best.

B.3 Video Captioning on M-VAD

For the M-VAD dataset, we use 512512 dimension hidden vectors for the LSTMs to reduce overfitting. We initialize the LSTM weights with a uniform distribution in the range [0.1,0.1][-0.1,0.1] and all other weights with a uniform distribution in the range [0.05,0.05][-0.05,0.05]. We use a learning rate of 0.0010.001. We found a mixing ratio of 100:5:5100:5:5 alternative mini-batches of video captioning, unsupervised video prediction, and entailment generation works best.

B.4 Entailment Generation

Here, we use video captioning to in turn help improve entailment generation results. We use the same hyperparameters for both the baseline and the multi-task model (Sec. 5.3 and Table 4). We use a learning rate of 0.0010.001. All the trainable weights are initialized with a uniform distribution in the range [0.08,0.08][-0.08,0.08]. We found a mixing ratio of 100:20100:20 alternate mini-batches training of entailment generation and video captioning to perform best.

Appendix C Analysis

In Sec. 5.5 of the main paper, we discussed examples comparing the generated captions of the final many-to-many multi-task model with those of the baseline. Here, we also separately compare our one-to-many (video prediction based) and many-to-one (entailment generation based) multi-task models with the baseline. As shown in Table 8, our one-to-many multi-task model better identifies the actions and objects in comparison to the baseline, because the video prediction task helps it learn better context-aware visual representations, e.g., “a man is eating something” vs. “a man is drinking something” and “a woman is slicing a vegetable” vs. “a woman is slicing an onion”.

On the other hand, the many-to-one multi-task (with entailment generation) seems to be stronger at generating a caption which is a logically-implied entailment of a ground-truth caption, e.g., “a cat is playing with a cat” vs. “a cat is playing” and “a woman is talking” vs “a woman is doing makeup” (see Table 8).

References