VideoGraph: Recognizing Minutes-Long Human Activities in Videos

Noureldien Hussein, Efstratios Gavves, Arnold W. M. Smeulders

Introduction

Human activities in videos can take many minutes to unfold, each is packed with plentiful of fine-grained visual details. Take for example two activities: “making pancake” or “preparing scrambled eggs”. The question is what makes a difference between these two activities? Is it the fine-grained details in each, or the overall painted picture by each? Or both?

The goal of this paper is to recognize minutes-long human activities as defined by , also referred to as complex actions in . A long-range activity consists of a set of unit-actions , also known as one-actions . For example, the activity of “making pancakes” includes unit-actions: “cracking egg”, “pour milk” and “fry pancake”. Some of these unit-actions are crucial to distinguish the activity. For example, the unit-action “cracking egg” is all what needed to discriminate the activity of “making pancakes” from “preparing coffee”. Also, long-range activity is recognized only in its entirety, as its unit-actions are insufficient by themselves. For example, only a short video snippet of unit-action “cracking egg” cannot tell apart “making pancake” from “preparing scrambled eggs”, as both activities share the same unit-action “cracking egg”. Added to this, the temporal order of unit-actions for a specific activity may be permuted. There exist different orders of how we can carry out an activity, like “prepare coffee”, see figure 1. Nonetheless, there exist some sort of temporal structure for such activity. One can start “preparing coffee” by “taking cup” and usually end up with “pour sugar” and “stir coffee”. So, to recognize long-range human activities, goals to be met are: modeling the temporal structure of the activity in its entirety, and occasionally paying attention to its fine-grained details.

There exist two distinct approaches for long-range temporal modelling. The first approach is orderless modeling. Statistical pooling and vector encoding [4; 5] are used to aggregate video information over time. The upside is the ability to address seemingly minutes- or even hours-long videos. The downside, however, is the inability to learn temporal patterns and the arrow-of-time . Both are proven to be crucial for some tasks [7; 8]. The second approach is order-ware modelling. 3D CNN is proven to be successful in learning spatiotemporal concepts for short video snippets with strict temporal pattern . Careful design choices enable them to model up to minute-long temporal dependencies . But for minutes-long human activities, the strict temporal pattern no longer exists. So, the question arises: how to model the temporal structure of minutes or even hour-long human activities?

This paper proposes VideoGraph, a graph-inspired representation to achieve the aforementioned goal. A soft version of undirected graph in learned completely from the dataset. The graph nodes represent the key latent concepts of which the human activity is composed. These latent concepts are analogous to one-actions. While the graph edges represent the temporal relationship between these latent concepts, i.e. the graph nodes. VideoGraph has the following novelties. i. In its graph-inspired representation, VideoGraph models human activity for up to thirty-minute videos, whereas the state-of-the-art is one minute . ii. A proposed node embedding block to learn the graph nodes from data. This circumvents the node annotation burden for long-range videos, and makes VideoGraph extensible to video datasets without node-level annotation. iii. A novel graph embedding layer to learn the relationships between graph nodes. The outcome is representing the temporal structure of long-range human activities. The result is achieving improvements on benchmarks for human activities: Breakfast , Epic-Kitchens and Charades .

Related Work

Orderless v.s. Order-aware Temporal Modeling. Be it short-, mid-, or long-range human activities, when it comes to temporal modeling, related methods are divided into two main families: orderless and order-aware. In orderless methods, the main focus is the statistical pooling of temporal signals in videos, without considering their temporal order or structure. Different pooling strategies are used, as max and average pooling , attention pooling , and context gating , to name a few. A similar approach is vector aggregation, for example: Fisher Vectors and VLAD [4; 5]. Although statistical pooling can trivially scale up to extremely long sequences in theory, this comes at a cost of losing the temporal structure, reminiscent of Bag-of-Words losing spatial understanding.

In order-aware methods, the main attention is payed to learning structured or ordered temporal patterns in videos. For example, LSTMs [15; 16], CRF , 3D CNNs [18; 9; 19; 20; 21]. Others propose temporal modeling layers on top of backbone CNNs, as in Temporal-Segments , Temporal-Relations and Rank-Pool . The outcome of order-aware methods is substantial improvements over their orderless counterparts in standard benchmarks [25; 26; 27]. Nevertheless, both temporal footprint and computational cost remain the main bottlenecks to learn long-range temporal dependencies. The best methods [2; 21] can model as much as 1k frames ( $\sim$ 30 seconds), which is a no match to minutes-long videos. This paper strives for the best of two worlds: learning the temporal structure of human activities in minutes-long videos.

Short-range Actions v.s. Long-range Activities.

Huge body of work is dedicated to recognizing human actions that take few seconds to unfold. Examples of well-established benchmarks are: Kinetics , Sports-1M , YouTube-8M , Moments in Time , 20B-Something and AVA . For these short- or mid-range actions, demonstrates that a few frames suffice for a successful recognition. Other strands of work shift their attention to human activities that take minutes or even an hour to unfold. Cooking-related activities are good examples, as in YouCook , Breakfast , Epic-Kitchens , MPII Cooking or 50-Salads . Other examples include instructional videos: Charades , and unscripted activities: EventNet , Multi-THUMOS .

In all cases, several works [1; 2; 34; 38] define the differences between short- and long-range human actions, albeit with a different naming or terms. We follow the same definition of . More formally, we use unit-actions to refer to fine-grained, short-range human actions, and activities to refer to long-range complex human activities.

Graph-based Representation. Earlier, graph-based representation has been used in storytelling [39; 40], and video retrieval . Different works use graph convolutions to learn concepts and/or relationships from data [42; 43; 44]. Recently, graph convolutions are applied to image understanding , video understanding [46; 47; 48; 49] and question answering . Despite their success in learning structured representations from video datasets, the main limitation of graph convolution methods is requiring the graph nodes and/or edges to be known a priori. Consequently, when node or frame-level annotations are not available, using these methods is hard. In contrast, this paper aims for a graph-inspired representation in which the graph nodes are fully inferred from data. The result is that our paper is extensible to datasets without node-level annotations.

Self-Attention is used extensively in language understanding . The recently proposed the transformer block shows substantial improvements in machine translation , image recognition and video understanding [48; 53] or even graph representations . The transformer block attends to a local feature conditioned on both local and global context. That is why it outperforms the self-attention mechanism [55; 56; 57], which is conditioned on only the local feature.

A video of human activity consists of short snippets of unit-actions. This paper is inspired by all these attention mechanisms to attend to a unit-action (i.e. local feature) based on the surrounding activity (i.e. global context).

Method

Motivation. We observe that a minutes-long and complex human activity usually is sub-divided into unit-actions. Similar understanding is concluded by [1; 2], see Fig. 1. So, one can learn the temporal dependencies between these unit-actions using methods for sequence modeling in videos, as LSTM or 3D CNN . However, these methods face the following limitations. First, such activities may take several minutes or even hours to unfold. Second, as video instances of the same activity are usually wildly different, there is no single temporal sequence that these methods can learn. For example, one can “prepare coffee” in many different ways, as the various paths in Fig. 1 indicate. Nevertheless, there seems to be an over-arching weak temporal structure of unit-actions when making a coffee.

We are inspired by graphs to represent the temporal structure of the human activities in videos. The upside is the ability of a graph-based representation to span minutes- or even hour-long temporal sequence of unit-actions while preserving their temporal relationships. The proposed method, VideoGraph, is depicted in Fig. 2, and in the following, we discuss its details.

In sum, the node attention block takes a feature $x_{i}$ , corresponding to a short video segment $s_{i}$ and measures how similar $\bm{\alpha}$ it is to learned set of latent concepts $\hat{Y}$ . The similarities $\bm{\alpha}$ are then used to attend to the latent concepts. This is crucial for recognizing long-range videos, where the network is not feed-forwarded only with a short video segment $x_{i}$ but with global representation $Y$ . This gives the network the ability for focus on both local video signal $x_{i}$ and global learned context $\hat{Y}$ .

Our node attention block is different from the non-local counterpart in twofold. First, the attention values are conditioned on local $x_{i}$ and global $\hat{Y}$ signals. Second, non-local does tensor product between attention values $\bm{\alpha}$ and local signal $x_{i}$ , while we attend by scalar multiplication between $\bm{\alpha},\hat{Y}$ to retrain the node dimension. Lastly, our node attention block is much more simpler than the non-local, as we use only one fully-connected layer.

Having learned the graph edges using convolutional operations, we proceed with BatchNormalization and ReLU non-linearity. Finally, we downsample the entire graph representation $\mathbf{Z}$ over both time and node dimensions using MaxPooling operation. It uses kernel size $3$ and stride $3$ for both the time and node dimensions. Thus, after one layer of graph embedding, the result graph representation is reduced from $T{\mkern-2.0mu\times\mkern-2.0mu}N{\mkern-2.0mu\times\mkern-2.0mu}H{\mkern-2.0mu\times\mkern-2.0mu}W{\mkern-2.0mu\times\mkern-2.0mu}C$ to $(T/3){\mkern-2.0mu\times\mkern-2.0mu}(N/3){\mkern-2.0mu\times\mkern-2.0mu}H{\mkern-2.0mu\times\mkern-2.0mu}W{\mkern-2.0mu\times\mkern-2.0mu}C$ .

Experiments

VideoGraph is trained with batch-size 32 for 500 epoch. It is optimized with SGD with 0.1, 0.9 and 0.00001 as learning rate, momentum and weight decay, respectively. It is implemented using TensorFlow and Keras .

As this paper focus on human activities spanning many minutes, we choose to conduct our experiments on the following benchmarks: Breakfast , Epic-Kitchens and Charades . Other benchmarks for human activities contain short-range videos, i.e. a minute or less, thus do not fall within the scope of this paper.

Breakfast is a dataset for task-oriented human activities, with the focus on cooking. It is a video classification task of 12 categories of breakfast activities. It contains 1712 videos in total, 1357 for training and 335 for test. The average length of videos is 2.3 minutes. The activities are performed by 52 actors, 44 for training and 8 for test. Having different actors for training and test splits is a realistic setup for testing generalization. Each video is represents only one category of focus activity. Besides, each video has temporal annotation of unit-actions comprising the activity. In total, there are 48 classes of unit-actions. In our experiments, we only use the activity annotation, and we don’t use the temporal annotation of unit-actions.

Epic-Kitchens is a recently introduced large-scale dataset for cooking activities. In total, it contains 274 videots performed by 28 actors in different kitchen setups. Each video represents a cooking different cooking activity. The average length of videos is 30 minutes, which makes it ideal for experimenting very long-range temporal modeling. Originally, the task proposed by the dataset is classification on short video snippets, with average length of $\sim$ 3.7 seconds. The provided labels are, therefore, the categories of objects, verbs and unit-actions in each video snippet. However, the dataset does no provide video-level category. That is why we consider all the object labels of a specific video as video-level label. Hence, posing the problem as multi-label classification of these videos. This setup is exactly the same used in Charades for video classification. For performance evaluation, we use mean Average Precision (mAP), implemented in Sk-Learn .

Charades is a dataset for multi-label classification of action videos. It consists of 8k, 1.2k and 2k video for training, validation and testing, respectively. is multi-label, action classification, video dataset with 157 classes. Each video spans 30 seconds and comprises of 6 unit-actions, on average. This is why we choose Charades, as it fits perfectly to the needs of this paper. For evaluation, we use mAP, as detailed in .

2 Experiments on Benchmarks

In this section, we experiment and evaluate VideoGraph on benchmark datasets: Breakfast, Charades and Epic-Kitchens, and we compare against related works. We choose two strong methods to compare against. The first is Timeception . The reason is that it can model 1k timesteps, which is up to a minute-long video. Another reason is that Timeception is an order-ware temporal method. The second related work is ActionVLAD . The reason is that it is a strong example of orderless method. It also can aggregate temporal signal for very long videos.

VideoGraph resides on top of backbone CNN, be it spatial 2D CNN, or spatio-temporal 3D CNN. So, in our comparison, we use two backbone CNNs, namely ResNet-152 and I3D . By default, I3D is designed to model a short video segment of 8 frames. But thanks to the fully-convolutional architecture, I3D can indeed process minutes-long video. This is made possible by average pooling the features of many videos snippets, in logit layer, i.e. before softmax activation . ResNet-152 is a frame-level classifier. To extend it to video classification, we follow the same approach used in I3D and average pool the logits, i.e. before softmax. In all the following comparisons, we use 512 frames, or 64 segments, per video as input to I3D. And we use 64 frames per video as and input to ResNet-152.

Breakfast. Each video in this dataset depicts a complex breakfast activity. Thus, the task inhand is single-label classification. The evaluation metric used is the classification accuracy. We experiment our model on Breakfast, and we compare against baseline methods. The results are reported in table 2.

When comparing VideoGraph against related works, see table 2, Timeception and VideoGraph, we notice that we are on bar with Timeception. VideoGraph performs better when trained on single-label video dataset, where each video has one label. This gives VideoGraph an ample opportunity to tailor the graph-inspired representation for each class. However, as mentioned, we pose the task in Epic-Kitchen as multi-label classification. That is, no single category for a video. That’s when VideoGraph does not perform as good.

Charades. In this experiment, we evaluate our model on Charades dataset. And we compare the performance against recent works. The results are reported in Table 1. VideoGraph improves the performance of the backbone CNN. For VideoGraph, Charades is particularly challenging dataset, for two reasons. First, the average video length is 30 seconds, and VideoGraph learns better representstion for long-range videos. Second, it is a multi-label classification, and that’s when VideoGraph is not able to learn category-specific unique graph.

3 Learned Graph Nodes

The proposed node attention block, see figure 4a, learns latent concept representation $\hat{Y}$ using fully-connected layer. This learning is conditioned on the initial value $Y$ . We found that this initial value is crucial for VideoGraph to converge. We experiment with 3 different types of initialization: i. random values, ii. Sobol sequence and iii. k-means centroids. Random values seems to be a natural choice, as all the learned weights in the model are randomly initialized before training. Sobol sequence is a plausible choice, as the sequence guarantees low discrepancies between the initial values. The last choice has proven to be successful in ActionVLAD . The centroids are obtained by clustering the feature maps of the last convolutional layer of the backbone CNN. However, we do not find one winning strategy across the benchmarks used. We find that Sobol sequence is the best choice for training on Epic-Kitchens and Charades. While the random initialization gives the best results on Breakfast. In table 1, we report the performance of VideoGraph using different initialization choices for the latent concepts $Y$ . In all cases, we see in figure 5 that the node attention layer successfully learns discriminant representations of latent concepts, as the training proceedes. In other words, the networks learns to increase the Euclidean distance between each pair of latent concepts. This is further demonstrated in figure 6.

4 Learned Graph Edges

Importance of Temporal Structure. In this experiment, we validate by how much VideoGraph depends on the temporal structure and weak temporal order to recognize the human activities. To this end, we choose Breakfast, as it is temporally well-structured dataset. VideoGraph is trained on ordered set of $64$ timesteps. We alter the temporal order of these timesteps and test the performance of VideoGraph. We use different alterations: i. random order, and ii. reversed order. Then, we measure the performance of VideoGraph, as well as baselines, on Breakfast testset.

We notice, from table 4, a huge drop in performance for both VideoGraph and Timeception. However, as expected, no drop in performance for ActionVLAD, as it is completely orderless model. The conclusion is VideoGraph encodes the temporal structure of the human activities in breakfast. Added to this, it suffered slightly less drop in performance than Timeception. More importantly, figure 8 shows the confusion matrix of classifiyng the videos of Breakfast using two cases: i natural order of temporal video segments, and ii. random order of the video segments. We notice video graph makes more mistakes when trained on random order. It mistakes “scrambled egg” for “fried egg” if temporal order is neglected.

Conclusion

To successfully recognize minutes-long human activities such as “preparing breakfast” or “cleaning the house”, we argued that a successful solution needs to capture both the whole picture and attention to details. To this end, we proposed VideoGraph, a graph-inspired representation to model the temporal structure of such long-range human activities. Firstly, thanks to the node attention layer, VideoGraph can learn the graph nodes. This alleviate the need of node-level annotation, which is prohibitive and expensive in nowadays video dataset. Secondly, we proposed graph embedding layer. It learns the relationship between graph nodes and how these nodes transition over time. Also, it compresses the graph representation to be feed for a classifier. We demonstrated the effectiveness of VideoGraph on three benchmarks: Breakfast, Epic-Kitchens and Charades. VideoGraph achieves good performance on the three of them. We also discussed some of the upsides and downside of VideoGraph.