Localizing Moments in Video with Natural Language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

Introduction

Consider the video depicted in Figure LABEL:fig:ConceptFigure, in which a little girl jumps around, falls down, and then gets back up to start jumping again. Suppose we want to refer to a particular temporal segment, or moment, from the video, such as when the girl resiliently begins jumping again after she has fallen. Simply referring to the moment via an action, object, or attribute keyword may not uniquely identify it. For example, important objects in the scene, such as the girl, are present in each frame. Likewise, recognizing all the frames in which the girl is jumping will not localize the moment of interest as the girl jumps both before and after she has fallen. Rather than being defined by a single object or activity, the moment may be defined by when and how specific actions take place in relation to other actions. An intuitive way to refer to the moment is via a natural language phrase, such as “the little girl jumps back up after falling”.

Motivated by this example, we consider localizing moments in video with natural language. Specifically, given a video and text description, we identify start and end points in the video which correspond to the given text description. This is a challenging task requiring both language and video understanding, with important applications in video retrieval, such as finding particular moments from a long personal holiday video, or desired B-roll stock video footage from a large video library (e.g., Adobe Stockhttps://stock.adobe.com, Gettyhttp://www.gettyimages.com, Shutterstockhttps://www.shutterstock.com).

Existing methods for natural language based video retrieval retrieve an entire video given a text string but do not identify when a moment occurs within a video. To localize moments within a video we propose to learn a joint video-language model in which referring expressions and video features from corresponding moments are close in a shared embedding space. However, in contrast to whole video retrieval, we argue that in addition to video features from a specific moment, global video context and knowing when a moment occurs within a longer video are important cues for moment retrieval. For example, consider the text query “The man on the stage comes closest to the audience”. The term “closest” is relative and requires temporal context to properly comprehend. Additionally, the temporal position of a moment in a longer video can help localize the moment. For the text query “The biker starts the race”, we expect moments earlier in the video in which the biker is racing to be closer to the text query than moments at the end of the video. We thus propose the Moment Context Network (MCN) which includes a global video feature to provide temporal context and a temporal endpoint feature to indicate when a moment occurs in a video.

A major obstacle when training our model is that current video-language datasets do not include natural language which can uniquely localize a moment. Additionally, datasets like are small and restricted to specific domains, such as dash-cam or cooking videos, while datasets sourced from movies and YouTube are frequently edited and tend to only include entertaining moments (see for discussion). We believe the task of localizing moments with natural language is particularly interesting in unedited videos which tend to include uneventful video segments that would generally be cut from edited videos. Consequently, we desire a dataset which consists of distinct moments from unedited video footage paired with descriptions which can uniquely localize each moment, analogous to datasets that pair distinct image regions with descriptions .

To address this problem, we collect the Distinct Describable Moments (DiDeMo) dataset which includes distinct video moments paired with descriptions which uniquely localize the moment in the video. Our dataset consists of over 10,000 unedited videos with 3-5 pairs of descriptions and distinct moments per video. DiDeMo is collected in an open-world setting and includes diverse content such as pets, concerts, and sports games. To ensure that descriptions are referring and thus uniquely localize a moment, we include a validation step inspired by .

Contributions. We consider the problem of localizing moments in video with natural language in a challenging open-world setting. We propose the Moment Context Network (MCN) which relies on local and global video features. To train and evaluate our model, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 40,000 pairs of referring descriptions and localized moments in unedited videos.

Related Work

Localizing moments in video with natural language is related to other vision tasks including video retrieval, video summarization, video description and question answering, and natural language object retrieval. Though large scale datasets have been collected for each of these tasks, none fit the specific requirements needed to learn how to localize moments in video with natural language.

Video Retrieval with Natural Language. Natural language video retrieval methods aim to retrieve a specific video given a natural language query. Current methods incorporate deep video-language embeddings similar to image-language embeddings proposed by . Our method also relies on a joint video-language embedding. However, to identify when events occur in a video, our video representation integrates local and global video features as well as temporal endpoint features which indicate when a candidate moment occurs within a video.

Some work has studied retrieving temporal segments within a video in constrained settings. For example, considers retrieving video clips from a home surveillance camera using text queries which include a fixed set of spatial prepositions (“across” and “through”) whereas considers retrieving temporal segments in 21 videos from a dashboard car camera. In a similar vein, consider aligning textual instructions to videos. However, methods aligning instructions to videos are restricted to structured videos as they constrain alignment by instruction ordering. In contrast, we consider localizing moments in an unconstrained open-world dataset with a wide array of visual concepts. To effectively train a moment localization model, we collect DiDeMo which is unique because it consists of paired video moments and referring expressions.

Video Summarization. Video summarization algorithms isolate temporal segments in a video which include important/interesting content. Though most summarization algorithms do not include textual input (), some use text in the form of video titles or user queries in the form of category labels to guide content selection . collects textual descriptions for temporal video chunks as a means to evaluate summarization algorithms. However, these datasets do not include referring expressions and are limited in scope which makes them unsuitable for learning moment retrieval in an open-world setting.

Video Description and Question Answering (QA). Video description models learn to generate textual descriptions of videos given video-description pairs. Contemporary models integrate deep video representations with recurrent language models . Additionally, proposed a video QA dataset which includes question/answer pairs aligned to video shots, plot synopsis, and subtitles.

YouTube and movies are popular sources for joint video-language datasets. Video description datasets collected from YouTube include descriptions for short clips of longer YouTube videos . Other video description datasets include descriptions of short clips sourced from full length movies . However, though YouTube clips and movie shots are sourced from longer videos, they are not appropriate for localizing distinct moments in video for two reasons. First, descriptions about selected shots and clips are not guaranteed to be referring. For example, a short YouTube video clip might include a person talking and the description like “A woman is talking”. However, the entire video could consist of a woman talking and thus the description does not uniquely refer to the clip. Second, many YouTube videos and movies are edited, which means “boring” content which may be important to understand for applications like retrieving video segments from personal videos might not be present.

Natural Language Object Retrieval. Natural language object retrieval can be seen as an analogous task to ours, where natural language phrases are localized spatially in images, rather than temporally in videos. Despite similarities to natural language object retrieval, localizing video moments presents unique challenges. For example, it often requires comprehension of temporal indicators such as “first” as well as a better understanding of activities. Datasets for natural language object retrieval include referring expressions which can uniquely localize a specific location in a image. Descriptions in DiDeMo uniquely localize distinct moments and are thus also referring expressions.

Language Grounding in Images and Videos. tackle the task of object grounding in which sentence fragments in a description are localized to specific image regions. Work on language grounding in video is much more limited. Language grounding in video has focused on spatially grounding objects and actions in a video , or aligning textual phrases to temporal video segments . However prior methods in both these areas () severely constrain natural language vocabulary (e.g., only considers four objects and four verbs) and consider constrained visual domains in small datasets (e.g., 127 videos from a fixed laboratory kitchen and only includes 520 sentences). In contrast, DiDeMo offers a unique opportunity to study temporal language grounding in an open-world setting with a diverse set of objects, activities, and attributes.

Moment Context Network

Our moment retrieval model effectively localizes natural language queries in longer videos. Given input video frames $v=\{v_{t}\}$ , where $t\in\{0,\dots,T-1\}$ indexes time, and a proposed temporal interval, $\hat{\tau}=\tau_{start}:\tau_{end}$ , we extract visual temporal context features which encode the video moment by integrating both local features and global video context. Given a sentence $s$ we extract language features using an LSTM network. At test time our model optimizes the following objective

where $D_{\theta}$ is a joint model over the sentence $s$ , video $v$ , and temporal interval $\tau$ given model parameters $\theta$ (Figure 1).

Visual Temporal Context Features. We encode video moments into visual temporal context features by integrating local video features, which reflect what occurs within a specific moment, global video features, which provide context for a video moment, and temporal endpoint features, which indicate when a moment occurs within a longer video. To construct local and global video features, we first extract high level video features using a deep convolutional network for each video frame, then average pool video features across a specific time span (similar to features employed by for video description and for whole video retrieval). Local features are constructed by pooling features within a specific moment and global features are constructed by averaging over all frames in a video.

When a moment occurs in a video can indicate whether or not a moment matches a specific query. To illustrate, consider the query “the bikers start the race.” We expect moments closer to the beginning of a video in which bikers are racing to be more similar to the description than moments at the end of the video in which bikers are racing. To encode this temporal information, we include temporal endpoint features which indicate the start and endpoint of a candidate moment (normalized to the interval $$). We note that our global video features and temporal endpoint features are analogous to global image features and spatial context features frequently used in natural language object retrieval .

Localizing video moments often requires localizing specific activities (like “jump” or “run”). Therefore, we explore two sources of visual input modalities; appearance or RGB frames ( ${v_{t}}$ ) and optical flow frames ( ${f_{t}}$ ). We extract $fc_{7}$ features from RGB frames using VGG pre-trained on ImageNet . We expect these features to accurately identify specific objects and attributes in video frames. Likewise, we extract optical flow features from the penultimate layer from a competitive activity recognition model . We expect these features to help localize moments which require understanding action.

Temporal context features are extracted by inputting local video features, global video features, and temporal endpoint features into a two layer neural network with ReLU nonlinearities (Figure 1 top). Separate weights are learned when extracting temporal context features for RGB frames (denoted as $P_{\theta}^{V}$ ) and optical flow frames (denoted as $P_{\theta}^{F}$ ).

Language Features. To capture language structure, we extract language features using a recurrent network (specifically an LSTM ). After encoding a sentence with an LSTM, we pass the last hidden state of the LSTM through a single fully-connected layer to yield embedded feature $P_{\theta}^{L}$ . Though our dataset contains over 40,000 sentences, it is still small in comparison to datasets used for natural language object retrieval (e.g., ). Therefore, we find that representing words with dense word embeddings (specifically Glove ) as opposed to one-hot encodings yields superior results when training our LSTM.

Joint Video and Language Model. Our joint model is the sum of squared distances between embedded appearance, flow, and language features

where $\eta$ is a tunable (via cross validation) “late fusion” scalar parameter. $\eta$ was set to $2.33$ via ablation studies.

Ranking Loss for Moment Retrieval. We train our model with a ranking loss which encourages referring expressions to be closer to corresponding moments than negative moments in a shared embedding space. Negative moments used during training can either come from different segments within the same video (intra-video negative moments) or from different videos (inter-video negative moments). Revisiting the video depicted in Figure LABEL:fig:ConceptFigure, given a phrase “the little girl jumps back up after falling” many intra-video negative moments include concepts mentioned in the phrase such as “little girl” or “jumps”. Consequently, our model must learn to distinguish between subtle differences within a video. By comparing the positive moment to the intra-video negative moments, our model can learn that localizing the moment corresponding to “the little girl jumps back up after falling” requires more than just recognizing an object (the girl) or an action (jumps). For training example $i$ with endpoints $\tau_{i}$ , we define the following intra-video ranking loss

where $\mathcal{L}^{R}(x,y)=\max(0,x-y+b)$ is the ranking loss, $\Gamma$ are all possible temporal video intervals, and $b$ is a margin. Intuitively, this loss encourages text queries to be closer to a corresponding video moment than all other possible moments from the same video.

Only comparing moments within a single video means the model must learn to differentiate between subtle differences without learning how to differentiate between broader semantic concepts (e.g., “girl” vs. “sofa”). Hence, we also compare positive moments to inter-video negative moments which generally include substantially different semantic content. When selecting inter-video negative moments, we choose negative moments which have the same start and end points as positive moments. This encourages the model to differentiate between moments based on semantic content, as opposed to when the moment occurs in the video. During training we do not verify that inter-video negatives are indeed true negatives. However, the language in our dataset is diverse enough that, in practice, we observe that randomly sampled inter-video negatives are generally true negatives. For training example $i$ , we define the following inter-video ranking loss

This loss encourages text queries to be closer to corresponding video moments than moments outside the video, and should thus learn to differentiate between broad semantic concepts. Our final inter-intra video ranking loss is

where $\lambda$ is a weighting parameter chosen through cross-validation.

The DiDeMo Dataset

A major challenge when designing algorithms to localize moments with natural language is that there is a dearth of large-scale datasets which consist of referring expressions and localized video moemnts. To mitigate this issue, we introduce the Distinct Describable Moments (DiDeMo) dataset which includes over 10,000 25-30 second long personal videos with over 40,000 localized text descriptions. Example annotations are shown in Figure 2.

To ensure that each description is paired with a single distinct moment, we collect our dataset in two phases (similar to how collected text to localize image regions). First, we asked annotators to watch a video, select a moment, and describe the moment such that another user would select the same moment based on the description. Then, descriptions collected in the first phase are validated by asking annotators to watch videos and mark moments that correspond to collected descriptions.

Harvesting Personal Videos. We randomly select over 14,000 videos from YFCC100M which contains over 100,000 Flickr videos with a Creative Commons License. To ensure harvested videos are unedited, we run each video through a shot detector based on the difference of color histograms in adjacent frames then manually filter videos which are not caught. Videos in DiDeMo represent a diverse set of real-world videos, which include interesting, distinct moments, as well as uneventful segments which might be excluded from edited videos.

Video Interface. Localizing text annotations in video is difficult because the task can be ambiguous and users must digest a 25-30s video before scrubbing through the video to mark start and end points. To illustrate the inherent ambiguity of our task, consider the phrase “The woman leaves the room.” Some annotators may believe this moment begins as soon as the woman turns towards the exit, whereas others may believe the moment starts as the woman’s foot first crosses the door threshold. Both annotations are valid, but result in large discrepancies between start and end points.

To make our task less ambiguous and speed up annotation, we develop a user interface in which videos are presented as a timeline of temporal segments. Each segment is displayed as a gif, which plays at 2x speed when the mouse is hovered over it. Following , who collected localized text annotations for summarization datasets, we segment our videos into 5-second segments. Users select a moment by clicking on all segments which contain the moment. To validate our interface, we ask five users to localize moments in ten videos using our tool and a traditional video scrubbing tool. Annotations with our gif-based tool are faster to collect (25.66s vs. 38.48s). Additionally, start and end points marked using the two different tools are similar. The standard deviation for start and end points marked when using the video scrubbing tool (2.49s) is larger than the average difference in start and end points marked using the two different tools (2.45s).

Moment Validation. After annotators describe a moment, we ask three additional annotators to localize the moment given the text annotation and the same video. To accept a moment description, we require that at least three out of four annotators (one describer and three validators) be in agreement. We consider two annotators to agree if one of the start or end point differs by at most one gif.

2 DiDeMo Summary

Table 1 compares our Distinct Describable Moments (DiDeMo) dataset to other video-language datasets. Though some datasets include temporal localization of natural language, these datasets do not include a verification step to ensure that descriptions refer to a single moment. In contrast, our verification step ensuring that descriptions in DiDeMo are referring expressions, meaning that they refer to a specific moment in a video.

Vocabulary. Because videos are curated from Flickr, DiDeMo reflects the type of content people are interested in recording and sharing. Consequently, DiDeMo is human-centric with words like “baby”, “woman”, and “man” appearing frequently. Since videos are randomly sampled, DiDeMo has a long tail with words like “parachute” and “violin”, appearing infrequently (28 and 38 times).

Important, distinct moments in a video often coincide with specific camera movements. For example, “the camera pans to a group of friends” or “zooms in on the baby” can describe distinct moments. Many moments in personal videos are easiest to describe in reference to the viewer (e.g., “the little boy runs towards the camera”). In contrast to other dataset collection efforts , we allow annotations to reference the camera, and believe such annotations may be helpful for applications like text-assisted video editing.

Table 2 contrasts the kinds of words used in DiDeMo to two natural language object retrieval datasets and two video description datasets . The three left columns report the percentage of sentences which include camera words (e.g., “zoom”, “pan”, “cameraman”), temporal indicators (e.g., “after” and “first”), and spatial indicators (e.g., “left” and “bottom”). We also compare how many words belong to certain parts of speech (verb, noun, and adjective) using the natural language toolkit part-of-speech tagger . DiDeMo contains more sentences with temporal indicators than natural language object retrieval and video description datasets, as well as a large number of spatial indicators. DiDeMo has a higher percentage of verbs than natural language object retrieval datasets, suggesting understanding action is important for moment localization in video.

Annotated Time Points. Annotated segments can be any contiguous set of gifs. Annotators generally describe short moments with 72.34% of descriptions corresponding to a single gif and 22.26% corresponding to two contiguous gifs. More annotated moments occur at the beginning of a video than the end. This is unsurprising as people generally choose to begin filming a video when something interesting is about to happen. In 86% of videos annotators described multiple distinct moments with an average of 2.57 distinct moments per video.

Evaluation

In this section we report qualitative and quantitative results on DiDeMo. First, we describe our evaluation criteria and then evaluate against baseline methods.

Qualitative Results. Figure 3 shows moments predicted by MCN. Our model is capable of localizing a diverse set of moments including moments which require understanding temporal indicators like “first” (Figure 3 top) as well as moments which include camera motion (Figure 3 middle). More qualitative results are in our appendix.

Fine-grained Moment Localization Even though our ground truth moments correspond to five-second chunks, we can evaluate our model on smaller temporal segments at test time to predict moment locations with finer granularity. Instead of extracting features for a five second segment, we evaluate on individual frames extracted at $\sim$ 3 fps. Figure 4 includes an example in which two text queries (“A ball flies over the athletes” and “A man in a red hat passed a man in a yellow shirt”) are correctly localized by our model. The frames which best correspond to “A ball flies over the athletes” occur in the first few seconds of the video and the moment “A man in a red hat passed a men in a yellow shirt” finishes before the end point of the fifth segment. More qualitative results are in our appendix.

Discussion. We introduce the task of localizing moments in video with natural language in a challenging, open-world setting. Our Moment Context Network (MCN) localizes video moments by harnessing local video features, global video features, and temporal endpoint features. To train and evaluate natural language moment localization models, we collect DiDeMo, which consists of over 40,000 pairs of localized moments and referring expressions. Though MCN properly localizes many natural language queries in video, there are still many remaining challenges. For example, modeling complex (temporal) sentence structure is still very challenging (e.g., our model fails to localize “dog stops, then starts rolling around again”). Additionally, DiDeMo has a long-tail distribution with rare activities, nouns, and adjectives. More advanced (temporal) language reasoning and improving generalization to previously unseen vocabulary are two potential future directions.

TD was supported by DARPA, AFRL, DoD MURI award N000141110688, NSF awards IIS-1427425, IIS-1212798, and the Berkeley AI Research (BAIR) Lab.

This appendix includes the following material:

Qualitative examples illustrating when global video features and tef features improve performance.

Qualitative examples contrasting RGB and flow input modalities.

Additional qualitative examples using the full Moment Context Network. See https://www.youtube.com/watch?v=MRO7_4ouNWU for a video example.

Results when training without a language feature.

List of words used to generate numbers in Table 2 of the main paper.

Qualitative video retrieval experiment. See https://www.youtube.com/watch?v=fuz-UBvgapk for a video example.

Discussion on ambiguity of annotations and our metrics.

Histrogram showing the moments annotators mark in our dataset.

Example video showing our annotation tool (see https://www.youtube.com/watch?v=vAvT5Amp408 and https://www.youtube.com/watch?v=9WWgndeEjMU.

Appendix A Impact of Global Video Features and TEF Features

In the main paper we quantitatively show that global video features and tef features improve model performance. Here, we highlight qualitative examples where the global video features and tef features lead to better localization.

Figure 5 shows examples in which including global context improves performance. Examples like “The car passes the closest to the camera” require context to identify the correct moment. This is sensible as the word “closest” is comparative in nature and determining when the car is closest requires viewing the entire video. Other moments which are correctly localized with context include “we first see the second baby” and “the dog reaches the top of the stairs”.

Figure 6 shows examples in which including temporal endpoint features (tef) correctly localizes a video moment. For moments like “we first see the people” the model without tef retrieves a video moment with people, but fails to retrieve the moment when the people first appear. Without the tef, the model has no indication of when a moment occurs in a video. Thus, though the model can identify if there are people in a moment, the model is unable to determine when the people first appear. Likewise, for moments like “train begins to move”, the model without tef retrieves a video moment in which the train is moving, but not a moment in which the train begins to move.

Appendix B RGB and Flow Input Modalities

In the main paper, we demonstrate that RGB and optical flow inputs are complementary. Here we show a few examples which illustrate how RGB and flow input modalities complement each other. Figure 7 compares a model trained with RGB input and a model trained with optical flow input (both trained with global video features and tef). We expect the model trained with RGB to accurately localize moments which require understanding the appearance of objects and people in a scene, such as “child jumps into arms of man wearing yellow shirt” (Figure 7 top row). We expect the model trained with flow to better localize moments which require understanding of motion (including camera motion) such as “a dog looks at the camera and jumps at it” and “camera zooms in on a man playing the drums” (Figure 7 row 3 and 4). Frequently, both RGB and optical flow networks can correctly localize a moment (Figure 7 bottom row). However, for best results we take advantage of the complimentary nature of RGB and optical flow input modalities in our fusion model.

Appendix C Qualitative Results for MCN

Figure 8 shows four videos in which we evaluate with fine-grained temporal windows at test time. Observing the plots in Figure 8 provides insight into the exact point at which a moment occurs. For example, our model correctly localizes the phrase “the blue trashcan goes out of view” (Figure 8 bottom right). The finegrained temporal segments that align best with this phrase occur towards the end of the third segment (approximately 14s). Furthermore, Figure 8 provides insight into which parts of the video are most similar to the text query, and which parts are most dissimilar. For example, for the phrase “the blue trashcan goes out of view”, there are two peaks; the higher peak occurs when the blue trashcan goes out of view, and the other peak occurs when the blue trashcan comes back into view.

In the main paper, running a natural language object retrieval (NLOR) model on our data is a strong baseline. We expect this model to perform well on examples which require recognizing a specific object such as “a man in a brown shirt runs by the camera” (Figure9 top row), but not as well for queries which require better understanding of action or camera movement such as “man runs towards camera with baby” (row 2 and 4 in Figure 9). Though the Moment Context Network performs well on DiDeMo, there are a variety of difficult queries it fails to properly localize, such as “Mother holds up the green board for the third time” (Figure 9 last row).

Please see https://www.youtube.com/watch?v=MRO7_4ouNWU for examples of moments correctly retrieved by our model.

Appendix D Additional Baselines

In the main paper we compare MCN to the natural language object retrieval model of . Since the publication of , better natural language object retrieval models have been proposed (e.g., ). We evaluate on our data, in a similar way to how we evaluated on our data in the main paper (Table 3 Row 5 in the main paper). We extract frames at 10 fps on videos in our test set and use to score each bounding box in an image for our description. The score for a frame is the max score of all bounding boxes in the frame, and the score for a moment is the average of all frames in the moment. We expect this model to do well when the moment descriptions can be well localized by localizing specific objects. Surprisingly, even though CMN outperforms for natural language object retrieval, it does worse than on our data (Table D row 6). One possible reason is that relies on parsing subject, relationship, and object triplets in sentences. Sentences in DiDeMo may not fit this structure well, leading to a decrease in performance. Additionally, is trained on MSCOCO and is trained on ReferIt . Though MSCOCO is larger than ReferIt, it is possible that the images in ReferIt are more similar to ours and thus transfers better to our task.

Additionally, we train , which is designed for natural language image retrieval, using our data. relies on first running a dependency parser to extract sentence fragments linked in a dependency tree (e.g., “black dog”, or “run fast”). It scores an image based on how well sentence fragments match a set of proposed bounding boxes. To train this model for our task, we also extract sentence fragments, but then score temporal regions based on how well sentence fragments match a ground truth temporal region. We train on our data (using a late fusion approach to combine RGB and optical flow), and find that this baseline performs similarly to other baselines (Table D row 8). In general, we believe our method works better than other baselines because it considers both positive and negative moments when learning to localize video moments and directly optimizes the R $@$ 1 metric.