Sort Story: Sorting Jumbled Images and Captions into Stories

Harsh Agrawal, Arjun Chandrasekaran, Dhruv Batra, Devi Parikh, Mohit Bansal

Introduction

Sequencing is a task for children that is aimed at improving understanding of the temporal occurrence of a sequence of events. The task is, given a jumbled set of images (and maybe captions) that belong to a single story, sort them into the correct order so that they form a coherent story. Our motivation in this work is to enable AI systems to better understand and predict the temporal nature of events in the world. To this end, we train machine learning models to perform the task of “sequencing”.

Temporal reasoning has a number of applications such as multi-document summarization of multiple sources of, say, news information where the relative order of events can be useful to accurately merge information in a temporally consistent manner. In question answering tasks [Richardson et al., 2013, Fader et al., 2014, Weston et al., 2015, Ren et al., 2015], answering questions related to when an event occurs, or what events occurred prior to a particular event require temporal reasoning. A good temporal model of events in everyday life, i.e., a “temporal common sense”, could also improve the quality of communication between AI systems and humans.

Stories are a form of narrative sequences that have an inherent temporal common sense structure. We propose the use of visual stories depicting personal events to learn temporal common sense. We use stories from the Sequential Image Narrative Dataset (SIND) [Huang et al., 2016] in which a set of 5 aligned image-caption pairs together form a coherent story. Given an input story that is jumbled (Fig. 1(a)), we train machine learning models to sort them into a coherent story (Fig. 1(b)).Note that ‘jumbled’ here refers to the loss of temporal ordering; image-caption pairs are still aligned.

Our contributions are as follows: – We propose the task of visual story sequencing. – We implement two approaches to solve the task: one based on individual story elements to predict position, and the other based on pairwise story elements to predict relative order of story elements. We also combine these approaches in a voting scheme that outperforms the individual methods. – As features, we represent a story element as both text-based features from the caption and image-based features, and show that they provide complementary improvements. For text-based features, we use both sentence context and relative order based distributed representations. – We show qualitative examples of our models learning temporal common sense.

Related Work

Temporal ordering has a rich history in NLP research. Scripts [Schank and Abelson, 2013], and more recently, narrative chains [Chambers and Jurafsky, 2008] contain information about the participants and causal relationships between events that enable the understanding of stories. A number of works [Mani and Schiffman, 2005, Mani et al., 2006, Boguraev and Ando, 2005] learn temporal relations and properties of news events from the dense, expert-annotated TimeBank corpus [Pustejovsky et al., 2003]. In our work, however, we use multi-modal story data that has no temporal annotations.

A number of works also reason about temporal ordering by using manually defined linguistic cues [Webber, 1988, Passonneau, 1988, Lapata and Lascarides, 2006, Hitzeman et al., 1995, Kehler, 2000]. Our approach uses neural networks to avoid feature design for learning temporal ordering.

Recent works [Modi and Titov, 2014, Modi, 2016] learn distributed representations for predicates in a sentence for the tasks of event ordering and cloze evaluation. Unlike their work, our approach makes use of multi-modal data with free-form natural language text to learn event embeddings. Further, our models are trained end-to-end while their pipelined approach involves parsing and extracting verb frames from each sentence, where errors may propagate from one module to the next (as discussed in Section 4.3).

?) use a generalized Mallows model for modeling sequences for coherence within single documents. Their approach may also be applicable to our task. Recently, ?) presented the “ROCStories” dataset of 5-sentence stories with stereotypical causal and temporal relations between events. In our work though, we make use of a multi-modal story-dataset that contains both images and associated story-like captions.

Some works in vision [Pickup et al., 2014, Basha et al., 2012] also temporally order images; typically by finding correspondences between multiple images of the same scene using geometry-based approaches. Similarly, ?) compose a story out of multiple short video clips. They define metrics based on scene dynamics and coherence, and use dense optical flow and patch-matching. In contrast, our work deals with stories containing potentially visually dissimilar but semantically coherent set of images and captions.

A few other recent works [Kim et al., 2015, Kim et al., 2014, Kim and Xing, 2014, Sigurdsson et al., 2016, Bosselut et al., 2016, Wang et al., 2016] summarize hundreds of individual streams of information (images, text, videos) from the web that deal with a single concept or event, to learn a common theme or storyline or for timeline summarization. Our task, however, is to predict the correct sorting of a given story, which is different from summarization or retrieval. ?) attempt to learn temporal embeddings of video frames in complex events. While their motivation is similar to ours, they deal with sampled frames from a video while we attempt to learn temporal common sense from multi-modal stories consisting of a sequence of aligned image-caption pairs.

Approach

In this section, we first describe the two components in our approach: unary scores that do not use context, and pairwise scores that encode relative orderings of elements. Next, we describe how we combine these scores through a voting scheme.

Let $\sigma\in\Sigma_{n}$ denote a permutation of $n$ elements (image-caption pairs). We use $\sigma_{i}$ to denote the position of element $i$ in the permutation $\sigma$ . A unary score $S_{u}(\sigma)$ captures the appropriateness of each story element $i$ in position $\sigma_{i}$ :

where $P(\sigma_{i}|i)$ denotes the probability of the element $i$ being present in position $\sigma_{i}$ , which is the output from an $n$ -way softmax layer in a deep neural network. We experiment with 2 networks – (1) A language-alone unary model (Skip-Thought+MLP) that uses a Gated Recurrent Unit (GRU) proposed by ?) to embed a caption into a vector space. We use the Skip-Thought [Kiros et al., 2015] GRU, which is trained on the BookCorpus [Zhu et al., 2015] to predict the context (preceding and following sentences) of a given sentence. These embeddings are fed as input into a Multi-Layer Perceptron (MLP). (2) A language+vision unary model (Skip-Thought+CNN+MLP) that embeds the caption as above and embeds the image via a Convolutional Neural Network (CNN). We use the activations from the penultimate layer of the 19-layer VGG-net [Simonyan and Zisserman, 2014], which have been shown to generalize well. Both embeddings are concatenated and fed as input to an MLP.

In both cases, the best ordering of the story elements (optimal permutation) $\sigma^{*}=\operatorname*{arg\,max}_{\sigma\in\Sigma_{n}}S_{u}(\sigma)$ can be found efficiently in $O(n^{3})$ time with the Hungarian algorithm [Munkres, 1957]. Since these unary scores are not influenced by other elements in the story, they capture the semantics and linguistic structures associated with specific positions of stories e.g., the beginning, the middle, and the end.

2 Pairwise Models

At train time, the parameters of the LSTM are learned end-to-end to minimize this asymmetric ordered loss (as measured over the gold-standard sequences). At test time, we use $S([[\sigma_{i}<\sigma_{j}]]\mid i,j)=L_{ij}$ . Thus, as we move away from the origin in the embedding space, we traverse through the sentences in a story. Each of these three pairwise approaches assigns a score $S(\sigma_{i},\sigma_{j}|i,j)$ to an ordered pair of elements (i,j), which is used to construct a pairwise scoring model:

by summing over the scores for all possible ordered pairs in the permutation. This pairwise score captures local contextual information in stories. Finding the best permutation $\sigma^{*}=\operatorname*{arg\,max}_{\sigma\in\Sigma_{n}}S_{p}(\sigma)$ under this pairwise model is NP-hard so approximations will be required. In our experiments, we study short sequences ( $n=5$ ), where the space of permutations is easily enumerable ( $5!=120$ ). For longer sequences, we can utilize integer programming methods or well-studied spectral relaxations for this problem.

3 Voting-based Ensemble

To combine the complementary information captured by the unary ( $S_{u}$ ) and pairwise models ( $S_{p}$ ), we use a voting-based ensemble. For each method in the ensemble, we find the top three permutations. Each of these permutations $(\sigma^{k})$ then vote for a particular element to be placed at a particular position. Let $V$ be a vote matrix such that $V_{ij}$ stores the number of votes for $i^{th}$ element to occur at $j^{th}$ position, i.e. $V_{ij}=\sum_{k}[[\sigma^{k}_{i}==j]])$ . We use the Hungarian algorithm to find the optimal permutation that maximizes the votes assigned, i.e. $\sigma^{*}_{\text{vote}}=\operatorname*{arg\,max}_{\sigma\in\Sigma_{n}}\sum_{i=1}^{n}\sum_{j=1}^{n}V_{ij}\cdot[[\sigma_{i}==j]]$ . We experimented with a number of model voting combinations and found the combination of pairwise Skip-Thought+CNN+MLP and neural position embeddings to work best (based on a validation set).

Experiments

We train and evaluate our model on personal multi-modal stories from the SIND (Sequential Image Narrative Dataset) [Huang et al., 2016], where each story is a sequence of 5 images and corresponding story-like captions. The narrative captions in this dataset, e.g., “friends having a good time” (as opposed to “people sitting next to each other”) capture a sequential, conversational language, which is characteristic of stories. We use 40,155 stories for training, 4990 for validation and 5055 stories for testing.

2 Metrics

We evaluate the performance of our model at correctly ordering a jumbled set of story elements using the following 3 metrics: Spearman’s rank correlation (Sp.) [Spearman, 1904] measures if the ranking of story elements in the predicted and ground truth orders are monotonically related (higher is better). Pairwise accuracy (Pairw.) measures the fraction of pairs of elements whose predicted relative ordering is the same as the ground truth order (higher is better). Average Distance (Dist.) measures the average change in position of all elements in the predicted story from their respective positions in the ground truth story (lower is better).

3 Results

As shown in Table 1, the pairwise models based on Skip-Thought features outperform the unary models in our task. However, the Pairwise Order Model performs worse than the unary Skip-Thought model, suggesting that the Skip-Thought features, which encode context of a sentence, also provide a crucial signal for temporal ordering of story sentences.

Contribution of Image Features

Augmenting the text features with image features results in a visible performance improvement of both the model trained with unary features and the model trained with pairwise features. While image features by themselves result in poor performance on this task, they seem to capture temporal information that is complementary to the text features.

Ensemble Voting

To exploit the fact that unary and pairwise models, as well as text and image features, capture different aspects of the story, we combine them using a voting ensemble. Based on the validation set, we found that combining the Pairwise Order model and the Pairwise model with both Skip-Thought and Image (CNN) features performs the best. This voting based method achieves the best performance on all three metrics. This shows that our different approaches indeed capture complementary information regarding feasible orderings of caption-image pairs to form a coherent story.

For comparison to existing related work, we tried to duplicate the pipelined approach of ?). For this, we first parse our story sentences to extract SVO (subject, verb, object) tuples (using the Stanford Parser [Chen and Manning, 2014]). However, this step succeeds for only 60% of our test data. Now even if we consider a perfect downstream algorithm that always makes the correct position prediction given SVO tuples, the overall performance is still a Spearman correlation of just 0.473, i.e., the upper bound performance of this pipelined approach is lower than the performance of our text-only end-to-end model (correlation of 0.546) in Table 1.

4 Qualitative Analysis

Visualizations of position predictions from our model demonstrate that it has learnt the three act structure [Trottier, 1998] in stories – the setup, the middle and the climax. We also present success and failure examples of our sorting model’s predictions. See the supplementary for more details and figures.

We visualize our model’s temporal common sense, in Fig. 2. The word clouds show discriminative words – the words that the model believes are indicative of sentence positions in a story. The size of a word is proportional to the ratio of its frequency of occurring in that position to other positions. Some words like ‘party’, ‘wedding’, etc., probably because our model believes that the start the story describes the setup – the occasion or event. People often tend to describe meeting friends or family members which probably results in the discriminative words such as ‘people’, ‘friend’, ‘everyone’ in the second and the third sentences. Moreover, the model believes that people tend to conclude the stories using words like ‘finally’, ‘afterwards’, tend to talk about ‘great day’, group ‘pictures’ with everyone, etc.

Conclusion

We propose the task of “sequencing” in a set of image-caption pairs, with the motivation of learning temporal common sense. We implement multiple neural network models based on individual and pairwise element-based predictions (and their ensemble), and utilize both image and text features, to achieve strong performance on the task. Our best system, on average, predicts the ordering of sentences to within a distance error of 0.8 (out of 5) positions. We also analyze our predictions and show qualitative examples that demonstrate temporal common sense.

Acknowledgements

We thank Ramakrishna Vedantam and the anonymous reviewers for their helpful suggestions. This work was supported by: NSF CAREER awards to DB and DP, ARO YIP awards to DB and DP, ICTAS Junior Faculty awards to DB and DP, Google Faculty Research award to DP and DB, ARL grant W911NF-15-2-0080 to DP and DB, ONR grant N00014-14-1-0679 to DB and N00014-16-1-2713 to DP, ONR YIP award to DP, Paul G. Allen Family Foundation Allen Distinguished Investigator award to DP, Alfred P. Sloan Fellowship to DP, AWS in Education Research grant to DB, NVIDIA GPU donations to DB and MB, an IBM Faculty Award and Bloomberg Data Science Research Grant to MB. Appendix

Appendix A Confusion Matrix for Predicting Position of an Element

We visualize the 5-way classification confusion matrix for our best performing method i.e., Voting ensemble of Pairwise Skip-Thought+Image(CNN) and Pairwise Order (Neural Position Embedding (NPE)) in Fig. 3. The block-diagonal matrix structure shows that the model predicts the first and the last element of a story reasonably well but is often confused by elements in the middle of the story. This visualization suggests that the model has learnt the three act structure in stories, i.e., the setup, the middle and the climax.

Appendix B Predicted Stories

We present qualitative examples of story orders predicted by the best performing model in Fig. 4. Fig. 4(a) shows example stories in which the position of all elements are predicted correctly. Fig. 4(b) shows stories in which none of the positions are predicted correctly by our model. These two examples show that our model clearly fails when there is no inherent temporal order in the story either via language or images.

Appendix C Temporal Common Sense

In the word cloud in Fig. 5, we visualize the words that the model finds discriminative in correct predictions. These are words from correctly predicted stories that the model believes are indicative of sentence positions in a story. The size of a word is proportional to the ratio of its frequency of occurring in that position to other positions. Our model captures events such as ‘carnival’, ‘reunion’, and sports topics like ‘baseball’, ‘soccer’, ‘skate’ in the first position. This could be the case because the first sentence of a story usually introduces the event that the story is based on. In Fig. 5(e) (word-cloud of the last sentence), we also observe that the model correctly learns cue-words such as ‘overall’, and ‘lastly’. It also learns words and events that frequently conclude stories such as ‘returned’, ‘tired’, ‘winning’, ‘winner’, and ‘celebration’.