Dense-Captioning Events in Videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles

Introduction

With the introduction of large scale activity datasets , it has become possible to categorize videos into a discrete set of action categories . For example, in Figure 1, such models would output labels like playing piano or dancing. While the success of these methods is encouraging, they all share one key limitation: detail. To elevate the lack of detail from existing action detection models, subsequent work has explored explaining video semantics using sentence descriptions . For example, in Figure 1, such models would likely concentrate on an elderly man playing the piano in front of a crowd. While this caption provides us more details about who is playing the piano and mentions an audience, it fails to recognize and articulate all the other events in the video. For example, at some point in the video, a woman starts singing along with the pianist and then later another man starts dancing to the music. In order to identify all the events in a video and describe them in natural language, we introduce the task of dense-captioning events, which requires a model to generate a set of descriptions for multiple events occurring in the video and localize them in time.

Dense-captioning events is analogous to dense-image-captioning ; it describes videos and localize events in time whereas dense-image-captioning describes and localizes regions in space. However, we observe that dense-captioning events comes with its own set of challenges distinct from the image case. One observation is that events in videos can range across multiple time scales and can even overlap. While piano recitals might last for the entire duration of a long video, the applause takes place in a couple of seconds. To capture all such events, we need to design ways of encoding short as well as long sequences of video frames to propose events. Past captioning works have circumvented this problem by encoding the entire video sequence by mean-pooling or by using a recurrent neural network (RNN) . While this works well for short clips, encoding long video sequences that span minutes leads to vanishing gradients, preventing successful training. To overcome this limitation, we extend recent work on generating action proposals to multi-scale detection of events. Also, our proposal module processes each video in a forward pass, allowing us to detect events as they occur.

Another key observation is that the events in a given video are usually related to one another. In Figure 1, the crowd applauds because a a man was playing the piano. Therefore, our model must be able to use context from surrounding events to caption each event. A recent paper has attempted to describe videos with multiple sentences . However, their model generates sentences for instructional “cooking” videos where the events occur sequentially and highly correlated to the objects in the video . We show that their model does not generalize to “open” domain videos where events are action oriented and can even overlap. We introduce a captioning module that utilizes the context from all the events from our proposal module to generate each sentence. In addition, we show a variant of our captioning module that can operate on streaming videos by attending over only the past events. Our full model attends over both past as well as future events and demonstrates the importance of using context.

To evaluate our model and benchmark progress in dense-captioning events, we introduce the ActivityNet Captions datasetThe dataset is available at http://cs.stanford.edu/people/ranjaykrishna/densevid/. For a detailed analysis of our dataset, please see our supplementary material.. ActivityNet Captions contains 20k videos taken from ActivityNet , where each video is annotated with a series of temporally localized descriptions (Figure 1). To showcase long term event detection, our dataset contains videos as long as $10$ minutes, with each video annotated with on average 3.65 sentences. The descriptions refer to events that might be simultaneously occurring, causing the video segments to overlap. We ensure that each description in a given video is unique and refers to only one segment. While our videos are centered around human activities, the descriptions may also refer to non-human events such as: two hours later, the mixture becomes a delicious cake to eat. We collect our descriptions using crowdsourcing find that there is high agreement in the temporal event segments, which is in line with research suggesting that brain activity is naturally structured into semantically meaningful events .

With ActivityNet Captions, we are able to provide the first results for the task of dense-captioning events. Together with our online proposal module and our online captioning module, we show that we can detect and describe events in long or even streaming videos. We demonstrate that we are able to detect events found in short clips as well as in long video sequences. Furthermore, we show that utilizing context from other events in the video improves dense-captioning events. Finally, we demonstrate how ActivityNet Captions can be used to study video retrieval as well as event localization.

Related work

Dense-captioning events bridges two separate bodies of work: temporal action proposals and video captioning. First, we review related work on action recognition, action detection and temporal proposals. Next, we survey how video captioning started from video retrieval and video summarization, leading to single-sentence captioning work. Finally, we contrast our work with recent work in captioning images and videos with multiple sentences.

Early work in activity recognition involved using hidden Markov models to learn latent action states , followed by discriminative SVM models that used key poses and action grammars . Similar works have used hand-crafted features or object-centric features to recognize actions in fixed camera settings. More recent works have used dense trajectories or deep learning features to study actions. While our work is similar to these methods, we focus on describing such events with natural language instead of a fixed label set.

To enable action localization, temporal action proposal methods started from traditional sliding window approaches and later started building models to propose a handful of possible action segments . These proposal methods have used dictionary learning or RNN architectures to find possible segments of interest. However, such methods required each video frame to be processed once for every sliding window. DAPs introduced a framework to allow proposing overlapping segments using a sliding window. We modify this framework by removing the sliding windows and outputting proposals at every time step in a single pass of the video. We further extend this model and enable it to detect long events by implementing a multi-scale version of DAPs, where we sample frames at longer strides.

Orthogonal to work studying proposals, early approaches that connected video with language studied the task of video retrieval with natural language. They worked on generating a common embedding space between language and videos . Similar to these, we evaluate how well existing models perform on our dataset. Additionally, we introduce the task of localizing a given sentence given a video frame, allowing us to now also evaluate whether our models are able to locate specified events.

In an effort to start describing videos, methods in video summarization aimed to congregate segments of videos that include important or interesting visual information . These methods attempted to use low level features such as color and motion or attempted to model objects and their relationships to select key segments. Meanwhile, others have utilized text inputs from user studies to guide the selection process . While these summaries provide a means of finding important segments, these methods are limited by small vocabularies and do not evaluate how well we can explain visual events .

After these summarization works, early attempts at video captioning simply mean-pooled video frame features and used a pipeline inspired by the success of image captioning . However, this approach only works for short video clips with only one major event. To avoid this issue, others have proposed either a recurrent encoder or an attention mechanism . To capture more detail in videos, a new paper has recommended describing videos with paragraphs (a list of sentences) using a hierarchical RNN where the top level network generates a series of hidden vectors that are used to initialize low level RNNs that generate each individual sentence . While our paper is most similar to this work, we address two important missing factors. First, the sentences that their model generates refer to different events in the video but are not localized in time. Second, they use the TACoS-MultiLevel , which contains less than $200$ videos and is constrained to “cooking” videos and only contain non-overlapping sequential events. We address these issues by introducing the ActivityNet Captions dataset which contains overlapping events and by introducing our captioning module that uses temporal context to capture the interdependency between all the events in a video.

Finally, we build upon the recent work on dense-image-captioning , which generates a set of localized descriptions for an image. Further work for this task has used spatial context to improve captioning . Inspired by this work, and by recent literature on using spatial attention to improve human tracking , we design our captioning module to incorporate temporal context (analogous to spatial context except in time) by attending over the other events in the video.

Dense-captioning events model

Our goal is to design an architecture that jointly localizes temporal proposals of interest and then describes each with natural language. The two main challenges we face are to develop a method that can (1) detect multiple events in short as well as long video sequences and (2) utilize the context from past, concurrent and future events to generate descriptions of each one. Our proposed architecture (Figure 2) draws on architectural elements present in recent work on action proposal and social human tracking to tackle both these challenges.

Our model first sends the video frames through a proposal module that generates a set of proposals:

All the proposals with a $score_{i}$ higher than a threshold are forwarded to our language model that uses context from the other proposals while captioning each event. The hidden representation $h_{i}$ of the event proposal module is used as inputs to the captioning module, which then outputs descriptions for each event, while utilizing the context from the other events.

1 Event proposal module

The proposal module in Figure 2 tackles the challenge of detecting events in short as well as long video sequences, while preventing the dense application of our language model over sliding windows during inference. Prior work usually pools video features globally into a fixed sized vector , which is sufficient for representing short video clips but is unable to detect multiple events in long videos. Additionally, we would like to detect events in a single pass of the video so that the gains over a simple temporal sliding window are significant. To tackle this challenge, we design an event proposal module to be a variant of DAPs that can detect longer events.

Input. Our proposal module receives a series of features capturing semantic information from the video frames. Concretely, the input to our proposal module is a sequence of features: $\{f_{t}=F(v_{t}:v_{t+\delta})\}$ where $\delta$ is the time resolution of each feature $f_{t}$ . In our paper, $F$ extracts C3D features where $\delta=16$ frames. The output of $F$ is a tensor of size $N{\times}D$ where $D=500$ dimensional features and $N=T/\delta$ discretizes the video frames.

DAPs. Next, we feed these features into a variant of DAPs where we sample the videos features at different strides ( $1$ , $2$ , $4$ and $8$ for our experiments) and feed them into a proposal long short-term memory (LSTM) unit. The longer strides are able to capture longer events. The LSTM accumulates evidence across time as the video features progress. We do not modify the training of DAPs and only change the model at inference time by outputting $K$ proposals at every time step, each proposing an event with offsets. So, the LSTM is capable of generating proposals at different overlapping time intervals and we only need to iterate over the video once, since all the strides can be computed in parallel. Whenever the proposal LSTM detects an event, we use the hidden state of the LSTM at that time step as a feature representation of the visual event. Note that the proposal model can output proposals for events that can be overlapping. While traditional DAPs uses non-maximum suppression to eliminate overlapping outputs, we keep them separately and treat them as individual events.

2 Captioning module with context

Once we have the event proposals, the next stage of our pipeline is responsible for describing each event. A naive captioning approach could treat each description individually and use a captioning LSTM network to describe each one. However, most events in a video are correlated and can even cause one another. For example, we saw in Figure 1 that the man playing the piano caused the other person to start dancing. We also saw that after the man finished playing the piano, the audience applauded. To capture such correlations, we design our captioning module to incorporate the “context” from its neighboring events. Inspired by recent work on human tracking that utilizes spatial context between neighboring tracks, we develop an analogous model that captures temporal context in videos by grouping together events in time instead of tracks in space.

Language modeling. Each language LSTM is initialized to have $2$ layers with $512$ dimensional hidden representation. We randomly initialize all the word vector embeddings from a Gaussian with standard deviation of $0.01$ . We sample predictions from the model using beam search of size $5$ .

3 Implementation details.

We only pass to the language model proposals that have a high IoU with ground truth proposals. Similar to previous work on language modeling , we use a cross-entropy loss across all words in every sentence. We normalize the loss by the batch-size and sequence length in the language model. We weight the contribution of the captioning loss with $\lambda_{1}=1.0$ and the proposal loss with $\lambda_{2}=0.1$ :

Training and optimization. We train our full dense-captioning model by alternating between training the language model and the proposal module every $500$ iterations. We first train the captioning module by masking all neighboring events for $10$ epochs before adding in the context features. We initialize all weights using a Gaussian with standard deviation of $0.01$ . We use stochastic gradient descent with momentum $0.9$ to train. We use an initial learning rate of $1{\times}10^{-2}$ for the language model and $1{\times}10^{-3}$ for the proposal module. For efficiency, we do not finetune the C3D feature extraction.

Our training batch-size is set to $1$ . We cap all sentences to be a maximum sentence length of $30$ words and implement all our code in PyTorch 0.1.10. One mini-batch runs in approximately $15.84$ ms on a Titan X GPU and it takes 2 days for the model to converge.

ActivityNet Captions dataset

The ActivityNet Captions dataset connects videos to a series of temporally annotated sentences. Each sentence covers an unique segment of the video, describing an event that occurs. These events may occur over very long or short periods of time and are not limited in any capacity, allowing them to co-occur. We will now present an overview of the dataset and also provide a detailed analysis and comparison with other datasets in our supplementary material.

On average, each of the 20k videos in ActivityNet Captions contains 3.65 temporally localized sentences, resulting in a total of 100k sentences. We find that the number of sentences per video follows a relatively normal distribution. Furthermore, as the video duration increases, the number of sentences also increases. Each sentence has an average length of 13.48 words, which is also normally distributed.

On average, each sentence describes $36$ seconds and $31\%$ of their respective videos. However, the entire paragraph for each video on average describes $94.6\%$ of the entire video, demonstrating that each paragraph annotation still covers all major actions within the video. Furthermore, we found that $10\%$ of the temporal descriptions overlap, showing that the events cover simultaneous events.

Finally, our analysis on the sentences themselves indicate that ActivityNet Captions focuses on verbs and actions. In Figure 3, we compare against Visual Genome , the image dataset with most number of image descriptions (4̃.5 million). With the percentage of verbs comprising ActivityNet Captionsbeing significantly more, we find that ActivityNet Captions shifts sentence descriptions from being object-centric in images to action-centric in videos. Furthermore, as there exists a greater percentage of pronouns in ActivityNet Captions, we find that the sentence labels will more often refer to entities found in prior sentences.

2 Temporal agreement amongst annotators

To verify that ActivityNet Captions ’s captions mark semantically meaningful events , we collected two distinct, temporally annotated paragraphs from different workers for each of the $4926$ validation and $5044$ test videos. Each pair of annotations was then tested to see how well they temporally corresponded to each other. We found that, on average, each sentence description had an tIoU of $70.2\%$ with the maximal overlapping combination of sentences from the other paragraph. Since these results agree with prior work , we found that workers generally agree with each other when annotating temporal boundaries of video events.

Experiments

We evaluate our model by detecting multiple events in videos and describing them. We refer to this task as dense-captioning events (Section 5.1). We test our model on ActivityNet Captions, which was built specifically for this task.

Next, we provide baseline results on two additional tasks that are possible with our model. The first of these tasks is localization (Section 5.2), which tests our proposal model’s capability to adequately localize all the events for a given video. The second task is retrieval (Section 5.3), which tests a variant of our model’s ability to recover the correct set of sentences given the video or vice versa. Both these tasks are designed to test the event proposal module (localization) and the captioning module (retrieval) individually.

To dense-caption events, our model is given an input video and is tasked with detecting individual events and describing each one with natural language.

Evaluation metrics. Inspired by the dense-image-captioning metric, we use a similar metric to measure the joint ability of our model to both localize and caption events. This metric computes the average precision across tIoU thresholds of $0.3$ , $0.5$ , $0.7$ when captioning the top $1000$ proposals. We measure precision of our captions using traditional evaluation metrics: Bleu, METEOR and CIDEr.

To isolate the performance of language in the predicted captions without localization, we also use ground truth locations across each test image and evaluate predicted captions.

Baseline models. Since all the previous models proposed so far have focused on the task of describing entire videos and not detecting a series of events, we only compare existing video captioning models using ground truth proposals. Specifically, we compare our work with LSTM-YT , S2VT and H-RNN . LSTM-YT pools together video features to describe videos while S2VT encodes a video using an RNN. H-RNN generates paragraphs by using one RNN to caption individual sentences while the second RNN is used to sequentially initialize the hidden state for the next sentence generation. Our model can be though of as a generalization of the H-RNN model as it uses context, not just from the previous sentence but from surrounding events in the video. Additionally, our method treats context, not as features from object detectors but encodes it from unique parts of the proposal module.

Variants of our model. Additionally, we compare different variants of our model. Our no context model is our implementation of S2VT. The full model is our complete model described in Section 3. The online model is a version of our full model that uses context only from past events and not from future events. This version of our model can be used to caption long streams of video in a single pass. The full $-$ attn and online $-$ attn models use mean pooling instead of attention to concatenate features, i.e. it sets $w_{j}=1$ in Equation 5.

Captioning results. Since all the previous work has focused on captioning complete videos, We find that LSTM-YT performs much worse than other models as it tries to encode long sequences of video by mean pooling their features (Table 1). H-RNN performs slightly better but attends over object level features to generate sentence, which causes it to only slightly outperform LSTM-YT since we demonstrated earlier that the captions in our dataset are not object centric but action centric instead. S2VT and our no context model performs better than the previous baselines with a CIDEr score of $20.97$ as it uses an RNN to encode the video features. We see an improvement in performance to $22.19$ and $22.94$ when we incorporate context from past events into our online $-$ attn and online models. Finally, we also considering events that will happen in the future, we see further improvements to $24.24$ and $24.56$ for the full $-$ attn and full models. Note that while the improvements from using attention is not too large, we see greater improvements amongst videos with more events, suggesting that attention is useful for longer videos.

Sentence order. To further benchmark the improvements calculated from utilizing past and future context, we report results using ground truth proposals for the first three sentences in each video (Table 2). While there are videos with more than three sentences, we report results only for the first three because almost all the videos in the dataset contains at least three sentences. We notice that the online and full context models see most of their improvements from subsequent sentences, i.e. not the first sentence. In fact, we notice that after adding context, the CIDEr score for the online and full models tend to decrease for the $1^{st}$ sentence.

Results for dense-captioning events. When using proposals instead of ground truth events (Table 1), we see a similar trend where adding more context improves captioning. However, we also see that the improvements from attention are more pronounced since there are many events that the model has to caption. Attention allows the model to adequately focus in on select other events that are relevant to the current event. We show examples qualitative results from the variants of our models in Figure 4. In (a), we see that the last caption in the no context model drifts off topic while the full model utilizes context to generate more reasonable context. In (c), we see that our full context model is able to use the knowledge that the vegetables are later mixed in the bowl to also mention the bowl in the third and fourth sentences, propagating context back through to past events. However, context is not always successful at generating better captions. In (c), when the proposed segments have a high overlap, our model fails to distinguish between the two events, causing it to repeat captions.

2 Event localization

One of the main goals of this paper is to develop models that can locate any given event within a video. Therefore, we test how well our model can predict the temporal location of events within the corresponding video, in isolation of the captioning module. Recall that our variant of the proposal module uses proposes videos at different strides. Specifically, we test with strides of $1$ , $2$ , $4$ and $8$ . Each stride can be computed in parallel, allowing the proposal to run in a single pass.

Setup. We evaluate our proposal module using recall (like previous work ) against (1) the number of proposals and (2) the IoU with ground truth events. Specifically, we are testing whether, the use of different strides does in fact improve event localization.

Results. Figure 5 shows the recall of predicted localizations that overlap with ground truth over a range of IoU’s from $0.0$ to $1.0$ and number of proposals ranging till $1000$ . We find that using more strides improves recall across all values of IoU’s with diminishing returns . We also observe that when proposing only a few proposals, the model with stride $1$ performs better than any of the multi-stride versions. This occurs because there are more training examples for smaller strides as these models have more video frames to iterate over, allowing them to be more accurate. So, when predicting only a few proposals, the model with stride 1 localizes the most correct events. However, as we increase the number of proposals, we find that the proposal network with only a stride of $1$ plateaus around a recall of $0.3$ , while our multi-scale models perform better.

3 Video and paragraph retrieval

While we introduce dense-captioning events, a new task to study video understanding, we also evaluate our intuition to use context on a more traditional task: video retrieval.

Setup. In video retrieval, we are given a set of sentences that describe different parts of a video and are asked to retrieve the correct video from the test set of all videos. Our retrieval model is a slight variant on our dense-captioning model where we encode all the sentences using our captioning module and then combine the context together for each sentence and match each sentence to multiple proposals from a video. We assume that we have ground truth proposals for each video and encode each proposal using the LSTM from our proposal model. We train our model using a max-margin loss that attempts to align the correct sentence encoding to its corresponding video proposal encoding. We also report how this model performs if the task is reversed, where we are given a video as input and are asked to retrieve the correct paragraph from the complete set of paragraphs in the test set.

Results. We report our results in Table 3. We evaluate retrieval using recall at various thresholds and the median rank. We use the same baseline models as our previous tasks. We find that models that use RNNs (no context) to encode the video proposals perform better than max pooling video features (LSTM-YT). We also see a direct increase in performance when context is used. Unlike dense-captioning, we do not see a marked increase in performance when we include context from future events as well. We find that our online models performs almost at par with our full model.

Conclusion

We introduced the task of dense-captioning events and identified two challenges: (1) events can occur within a second or last up to minutes, and (2) events in a video are related to one another. To tackle both these challenges, we proposed a model that combines a new variant of an existing proposal module with a new captioning module. The proposal module samples video frames at different strides and gathers evidence to propose events at different time scales in one pass of the video. The captioning module attends over the neighboring events, utilizing their context to improve the generation of captions. We compare variants of our model and demonstrate that context does indeed improve captioning. We further show how the captioning model uses context to improve video retrieval and how our proposal model uses the different strides to improve event localization. Finally, this paper also releases a new dataset for dense-captioning events: ActivityNet Captions.

Supplementary material

In the supplementary material, we compare and contrast our dataset with other datasets and provide additional details about our dataset. We include screenshots of our collection interface with detailed instructions. We also provide additional details about the workers who completed our tasks.

Curation and open distribution is closely correlated with progress in the field of video understanding (Table 4). The KTH dataset pioneered the field by studying human actions with a black background. Since then, datasets like UCF101 , Sports 1M , Thumos 15 have focused on studying actions in sports related internet videos while HMDB 51 and Hollywood 2 introduced a dataset of movie clips. Recently, ActivityNet and Charades broadened the domain of activities captured by these datasets by including a large set of human activities. In an effort to map video semantics with language, MPII MD and M-VAD released short movie clips with descriptions. In an effort to capture longer events, MSR-VTT , MSVD and YouCook collected a dataset with slightly longer length, at the cost of a few descriptions than previous datasets. To further improve video annotations, KITTI and TACoS also temporally localized their video descriptions. Orthogonally, in an effort to increase the complexity of descriptions, TACos multi-level expanded the TACoS dataset to include paragraph descriptions to instructional cooking videos. However, their dataset is constrained in the “cooking” domain and contains in the order of a $100$ videos, making it unsuitable for dense-captioning of events as the models easily overfit to the training data.

Our dataset, ActivityNet Captions, aims to bridge these three orthogonal approaches by temporally annotating long videos while also building upon the complexity of descriptions. ActivityNet Captions contains videos that an average of 180s long with the longest video running to over 10 minutes. It contains a total of 100k sentences, where each sentence is temporally localized. Unlike TACoS multi-level, we have two orders of magnitude more videos and provide annotations for an open domain. Finally, we are also the first dataset to enable the study of concurrent events, by allowing our events to overlap.

2 Detailed dataset statistics

As noted in the main paper, the number of sentences accompanying each video is normally distributed, as seen in Figure 6. On average, each video contains $3.65\pm 1.79$ sentences. Similarly, the number of words in each sentence is normally distributed, as seen in Figure 7. On average, each sentence contains $13.48\pm 6.33$ words, and each video contains $40\pm 26$ words.

There exists interaction between the video content and the corresponding temporal annotations. In Figure 8, the number of sentences accompanying a video is shown to be positively correlated with the video’s length: each additional minute adds approximately $1$ additional sentence description. Furthermore, as seen in Figure 9, the sentence descriptions focus on the middle parts of the video more than the beginning or end.

When studying the distribution of words in Figures 10 and 11, we found that ActivityNet Captions generally focuses on people and the actions these people take. However, we wanted to know whether ActivityNet Captions captured the general semantics of the video. To do so, we compare our sentence descriptions against the shorter labels of ActivityNet, since ActivityNet Captions annotates ActivityNet videos. Figure 16 illustrates that the majority of videos in ActivityNet Captions often contain ActivityNet’s labels in at least one of their sentence descriptions. We find that the many entry-level categories such as brushing hair or playing violin are extremely well represented by our captions. However, as the categories become more nuanced, such as powerbocking or cumbia, they are not as commonly found in our descriptions.

3 Dataset collection process

We used Amazon Mechanical Turk to annotate all our videos. Each annotation task was divided into two steps: (1) Writing a paragraph describing all major events happening in the videos in a paragraph, with each sentence of the paragraph describing one event (Figure 12; and (2) Labeling the start and end time in the video in which each sentence in the paragraph event occurred (Figure 13. We find complementary evidence that workers are more consistent with their video segments and paragraph descriptions if they are asked to annotate visual media (in this case, videos) using natural language first . Therefore, instead of asking workers to segment the video first and then write individual sentences, we asked them to write paragraph descriptions first.

Workers are instructed to ensure that their paragraphs are at least $3$ sentences long where each sentence describes events in the video but also makes a grammatically and semantically coherent paragraph. They were allowed to use co-referencing words (ex, he, she, etc.) to refer to subjects introduced in previous sentences. We also asked workers to write sentences that were at least $5$ words long. We found that our workers were diligent and wrote an average of 13.48 number of words per sentence. Each of the task and examples (Figure 14) of good and bad annotations.

Workers were presented with examples of good and bad annotations with explanations for what constituted a good paragraph, ensuring that workers saw concrete evidence of what kind of work was expected of them (Figure 14). We paid workers $\$ 3 $for every$ 5 $videos that were annotated. This amounted to an average pay rate of$ \ $8$ per hour, which is in tune with fair crowd worker wage rate .

4 Annotation details

Following research from previous work that show that crowd workers are able to perform at the same quality of work when allowed to video media at a faster rate , we show all videos to workers at $2$ X the speed, i.e. the videos are shown at twice the frame rate. Workers do, however, have the option to watching the videos at the original video speed and even speed it up to $3$ X or $4$ X the speed. We found, however, that the average viewing rate chosen by workers was $1.91$ X while the median rate was $1$ X, indicating that a majority of workers preferred watching the video at its original speed. We also find that workers tend to take an average of $2.88$ and a median of $1.46$ times the length of the video in seconds to annotate.

At any given time, workers have the ability to edit their paragraph, go back to previous videos to make changes to their annotations. They are only allowed to proceed to the next video if this current video has been completely annotated with a paragraph with all its sentences timestamped. Changes made to the paragraphs and timestamps are saved when ”previous video or ”next video” are pressed, and reflected on the page. Only when all videos are annotated can the worker submit the task. In total, we had $112$ workers who annotated all our videos.

Acknowledgements. This research was sponsored in part by grants from the Office of Naval Research (N00014-15-1-2813) and Panasonic, Inc. We thank JunYoung Gwak, Timnit Gebru, Alvaro Soto, and Alexandre Alahi for their helpful comments and discussion.