Temporal Dynamic Graph LSTM for Action-driven Video Object Detection

Yuan Yuan, Xiaodan Liang, Xiaolong Wang, Dit-Yan Yeung, Abhinav Gupta

Introduction

With the recent success of data-driven approaches in recognition, there has been a growing interest in scaling up object detection systems . However, unlike classification, exhaustively annotating object instances with diverse classes and bounding boxes is hardly scalable. Therefore, there has been a surge in exploring in unsupervised and weakly-supervised approaches for object detection. However, fully unsupervised approaches without any annotations currently give considerably inferior performance on similar tasks, while conventional weakly-supervised methods use static images to learn the detectors. These object detectors, however, fail to generalize to videos due to shift in domain. One alternative is to use these weakly-supervised approaches but using video frames themselves. However, current approaches rely heavily on the accuracy of image-level labels and are vulnerable to missing labels (as shown in Figure 1). Can we design a learning framework that is robust to these missing labels ?

In this paper, we explore a novel slightly-supervised video object detection pipeline that uses human action labels as supervision for object detection. As illustrated in Figure 1, the coarse human action labels spanning multiple frames (e.g., watching a laptop or sitting in a chair) help indicate the presence of participating object instances (e.g., laptop and chair). Compared to prior works, our investigated setting has two major merits: 1) the textual action descriptions for videos are much cheaper to collect, e.g., through text tags, search queries and action recognition datasets ; and 2) the intrinsic temporal coherence in video domain provides more cues to facilitate the recognition of each object instance and help overcome the missing label problem.

Action-driven supervision for object detection is much more challenging since it can only access object labels for some specific frames, while a considerable number of uninvolved object labels are unknown. As shown in the right column of Figure 1, four action categories are labeled for different periods in the given video. In each period, the action label (e.g., tidying a shelf) only points out the shelf category and misses the rest of the categories such as laptop, table, chair and refrigerator. On the other hand, the missed categories (e.g., laptop) may appear in other labeled actions in the same video. Inspired by this observation, we propose to alleviate the missing label issue by exploiting the rich temporal correlations of object instances in the video. The core idea is that action labels in a different period may help to infer the presence of some objects in this current period. Specifically, a novel temporal dynamic graph LSTM (TD-Graph LSTM) framework is introduced to model the complex and dynamic temporal graph structure for object proposals in the whole video and thus enable the joint reasoning for all frames. The knowledge of all action labels in the video can thus be effectively transfered into all frames to enhance their frame-level categorizations.

To incorporate the temporal correlation of object proposals for global reasoning, we resort to the family of recurrent neural networks due to their good sequential modeling capability. However, existing recurrent networks are largely limited in the constrained information propagation on fixed nodes following predefined routes such as tree-LSTM , graph-LSTM and structural-RNN . In contrast, due to the unknown object localizations and temporal motion, it is difficult to find an optimal structure that connects object proposals for routed information propagation to achieve weakly-supervised video object detection. The proposed TD-Graph LSTM, posed as a general dynamic recurrent structure, overcomes these limitations by performing the dynamic information propagation based on an adaptive temporal graph that varies over both time periods in the video and model status in each updating step.

Specifically, the dynamic temporal graph is constructed based on the visual correlation of object proposals across neighboring frames. The set of graph nodes denotes the entire collection of object proposals in all the frames, while graph edges are adaptively specified for consecutive frames in distinct learning steps. At each iteration, given the updated feature representation of object proposals, we only activate the edge connections with object proposals that have highest similarities with each current proposal. The adaptive graph topology can thus be constructed where different proposals are connected with different temporal correlated neighbors. TD-Graph LSTM alternatively performs the information propagation through each temporal graph topology and updates the graph topology at each iteration. In this way, our model enables the joint optimization of feature learning and temporal inference towards a robust slightly-supervised detection framework.

The contributions of this paper are summarized as 1) We explore a new slightly-supervised video object detection pipeline that leverages convenient action descriptions as the supervision; 2) A novel TD-Graph LSTM framework alleviates the missing label issue by enabling global reasoning over the whole video; 3) TD-Graph LSTM is posed as a general dynamic recurrent structure that performs temporal information propagation on an adaptively updated graph topology at each iteration; 4) We collect and release 5,000 frame annotations with object-level bounding boxes on daily-life videos, with the goal of evaluating our model and also helping advance the object detection community.

Related Works

Weakly-Supervised Object Detection. Though recent state-of-the-art fully-supervised detection pipelines have achieved great progress, they heavily rely on large-scale bounding-box annotations. To alleviate this expensive annotation labor, weakly-supervised methods have recently attracted a lot of interest. These approaches use cheaper image-level object labels rather than bounding boxes. Beyond the image domain, another line of research attempts to exploit the temporal information embedded in videos to facilitate the weakly-supervised object detection. Different from all the existing pipelines, we investigate a much cheaper action-driven object detection setting that aims to detect all object instances given only action descriptions. In addition, instead of employing multiple separate steps (e.g., detection and tracking) to capture motion patterns, our TD-graph LSTM is an end-to-end framework that incorporates the intrinsic temporal coherence with a designed dynamic recurrent network structure into the action-driven slightly-supervised detection.

Sequential Modeling. Recurrent neural networks, especially Long Short-Term Memory (LSTM) , have been adopted to address many video processing tasks such as action recognition , action detection , video prediction , and video summarization . However, limited by the fixed propagation route of existing LSTM structures , most of the previous works can only learn the temporal interdependency between the holistic frames rather than more fine-grained object-level motion patterns. Some recent approaches develop more complicated recurrent network structures. For instance, structural-RNN develops a scalable method for casting an arbitrary spatio-temporal graph as a rich RNN mixture. A more recent Graph LSTM defined over a pre-defined graph topology enables the inference for more complex structured data. However, both of them require a pre-fixed network structure for information propagation, which is impractical for weakly-supervised/slightly-supervised object detection without the knowledge of object localizations and precise object class labels. To handle the propagation over dynamically specified graph structures, we thus propose a new temporal dynamic network structure that supports the inference over the constantly changing graph topologies in different training steps.

The proposed TD-Graph LSTM

Figure 2 gives an overview of our TD-Graph LSTM. Each frame in the input video is first passed through a spatial ConvNet to obtain spatial visual features for region proposals. Based on visual features, similar regions in two consecutive frames are discovered and associated to indicate the same object across the temporal domain. A temporal graph structure is constructed by connecting all of the semantically similar regions in two consecutive frames, where graph nodes are represented by region proposals. The TD-Graph LSTM unit is then employed to recurrently propagate information over the whole temporal graph, where LSTM units take the spatial visual features as the input states. Benefiting from the graph topology, TD-Graph LSTM is capable of incorporating temporal motion patterns for participating objects in the action in a more efficient and meaningful way. TD-Graph LSTM outputs the enhanced temporal-aware features of all regions. Region-level classification is then employed to produce classification confidences. These region-level predictions can finally be aggregated to generate frame-level object class prediction, supervised by the object classes from action labels. The action-driven object categorization loss thus enables the holistic back-propagation into all regions in the video, where the prediction of each frame can mutually benefit from each other.

The proposed TD-Graph LSTM is comprised by three parametrized modules: spatial ConvNet $\Phi(\cdot)$ for visual feature extraction, TD-Graph LSTM unit $\Psi(\cdot)$ for recurrent temporal information propagation, and region-level classification module $\varphi(\cdot)$ . These three modules are iteratively updated, targeted at the action-driven object detection.

At each model updating step $t$ , a temporal graph structure $\mathcal{G}^{t}=<\mathbf{V},\mathcal{E}^{t}>$ for each video is constructed based on the updated spatial visual features $\mathbf{f}^{t}$ of all regions $\mathbf{r}$ in the videos, defined as $\mathcal{G}^{t}=\beta(\Phi^{t}(\mathbf{r}))$ . $\beta(\cdot)$ is a function to calculate the dynamic edge connections $\mathcal{E}^{t}$ conditioning on the updated visual features $\mathbf{f}^{t}=\Phi^{t}(\mathbf{r})$ . The TD-Graph LSTM unit $\Psi^{t}$ recurrently functions on the visual features $\mathbf{f}^{t}$ of all frames and propagates temporal information over the graph $\mathcal{G}^{t}$ to obtain the enhanced temporal-aware features $\hat{\mathbf{f}}^{t}=\Psi^{t}(\mathbf{f}^{t}|\mathcal{G}^{t})$ of all regions in the video. Based on the enhanced $\hat{\mathbf{f}}^{t}$ , the region-level classification module $\varphi$ produces classification confidences $\mathbf{rc}^{t}$ for all regions, as $\mathbf{rc}^{t}=\varphi(\hat{\mathbf{f}}^{t})$ . These region-level category confidences $\mathbf{rc}^{t}$ can be aggregated to produce frame-level category confidences $\mathbf{pc}^{t}=\gamma(\mathbf{rc}^{t})$ of all frames by summing the category confidences of all regions of each frame.

During training, we define the action-driven loss for each frame as a hinge loss function and train a multi-label image classification objective for all frames in the videos:

where $C$ is the number of classes and $\mathbf{y}_{c,i},i\in\{1,\dots,N\}$ represents action-driven object labels for each frame. For each frame $I_{i}$ , $y_{c,i}=1$ only if the action-driven object label $c$ is assigned to the frame $I_{i}$ , otherwise as -1. The objective function defined in Eq. 1 can be optimized by the Stochastic Gradient Descent (SGD) back-propagation. At each $t$ -th gradient updating, the temporal graph structure $\mathcal{G}^{t}$ is accordingly updated by $\beta(\Phi^{t}(\mathbf{r}))$ for each video. Thus, the TD-Graph LSTM unit optimizes over a dynamically updated graph structure $\mathcal{G}^{t}$ . In the following sections, we introduce the above-defined parametrized modules.

2 Spatial ConvNet

3 TD-Graph LSTM Unit

Dynamic Graph Updating. Given the updated visual features $\mathbf{f}^{t}_{i}$ of each frame $I_{i}$ , the temporal graph structure $\mathcal{G}^{t}=<\mathbf{V},\mathcal{E}^{t}>$ can be accordingly constructed by learning the dynamic edge connections $\mathcal{E}^{t}$ . The graph node $\mathbf{V}=\{v_{i,j}\},j=\{1,\dots,M\}$ is represented by visual features $\{\mathbf{f}^{t}_{i,j}\}$ of all regions in all frames; that is, $M\times N$ nodes for $M$ region proposals of $N$ frames. Each node $v_{i,j}$ is connected with nodes in the preceding frame $I_{i-1}$ and the nodes in subsequent frame $I_{i+1}$ . To incorporate the motion dependency in consecutive frames, the edge connections $\mathcal{E}^{t}_{i,i-1}$ between nodes in $I_{i}$ and $I_{i-1}$ are mined by considering their appearance similarities in visual features. Specifically, the edge weight between each pair of nodes $(v_{i,j},v_{i-1,j^{\prime}})$ is first calculated as $\frac{1}{2}\exp(-||\mathbf{f}^{t}_{i,j}-\mathbf{f}^{t}_{i-1,j^{\prime}}||_{2})$ . To make the model inference efficient and alleviate the missing issue, each node $v_{i,j}$ is only connected to $K$ nodes $v_{i-1,j^{\prime}}$ with the top- $K$ highest edge weights in preceding frame $I_{i-1}$ , and these activated edge weights are normalized to be summed as 1. We denote the normalized edge weight as $\omega^{t}_{i,i-1,j,j^{\prime}}$ . Thus, the updated temporal graph structure $\mathcal{G}^{t}$ can be regarded as an undirected $K$ -neighbor graph where each node $v_{i,j}$ is connected with at most $K$ nodes in previous frames.

TD-Graph LSTM. TD-Graph LSTM layer propagates temporal context over graph and recurrently updates the hidden states $\{\mathbf{h}_{i,j}^{t}\}$ of all regions in each frame $I_{i}$ to construct enhanced temporal-aware feature representations. These features are fed into the region-level classification module to compute the category-level confidences of each region. TD-Graph LSTM updates hidden state of frame $i$ by incorporating information from frame-level hidden state $\bar{\mathbf{h}}_{i-1}^{t}$ and memory state $\bar{\mathbf{m}}_{i-1}^{t}$ . The usage of the shared frame-level hidden state and memory state enables the provision of a compact memorization of temporal patterns in the previous frame and is more suitable for massive and possibly missing graph nodes (e.g., 500 in our setting) in a large temporal graph. After performing $N$ updating steps for all frames, our model effectively embeds the rich temporal dependency to obtain the enhanced temporal-aware feature representations of all regions in all frames. For updating the features of each node $v_{i,j}$ in the frame $I_{i}$ , the TD-Graph LSTM unit takes as the input its own visual features $\mathbf{f}_{i,j}^{t}$ , temporal context features $\hat{\mathbf{f}}_{i,j}^{t}$ , frame-level hidden states $\bar{\mathbf{h}}_{i-1}^{t}$ and memory states $\bar{\mathbf{m}}_{i-1}^{t}$ , and outputs the new hidden states $\mathbf{h}_{i,j}^{t}$ . Given the dynamic edge connections $e_{i,j}=\{<v_{i,j},v_{i-1,j^{\prime}}>\},j^{\prime}\in\mathcal{N}_{\mathcal{G}}(v_{i,j})$ , the temporal context features $\hat{\mathbf{f}}_{i,j}^{t}$ can be calculated by performing a weighted summation of features of connected regions:

And the shared frame-level hidden states $\bar{\mathbf{h}}_{i-1}^{t}$ and memory states $\bar{\mathbf{m}}_{i-1}^{t}$ can be computed as

The TD-Graph LSTM unit consists of four gates for each node $v_{i,j}$ : the input gate $\mathbf{gu}^{t}_{i,j}$ , the forget gate $\mathbf{gf}^{t}_{i,j}$ , the memory gate $\mathbf{gc}^{t}_{i,j}$ , and the output gate $\mathbf{go}^{t}_{i,j}$ . The $W^{u}_{t},W^{f}_{t},W^{c}_{t},W^{o}_{t}$ are the recurrent gate weight matrices specified for input visual features and $W^{ut}_{t},W^{ft}_{t},W^{ct}_{t},W^{ot}_{t}$ are those for temporal context features. $U^{u}_{t},U^{f}_{t},U^{c}_{t},U^{o}_{t}$ are the weight parameters specified for frame-level hidden states. The new hidden states and memory states in the graph $\mathcal{G}^{t}$ can be calculated as follows:

Here $\delta$ is a logistic sigmoid function, and $\odot$ indicates a point-wise product. Given the updated hidden states $\{\mathbf{h}^{t}_{i,j}\}$ and memory states $\{\mathbf{m}^{t}_{i,j}\}$ of all regions in frame $I_{i}$ , we can obtain new frame-level hidden states $\bar{\mathbf{h}}_{i}^{t}$ and memory states $\bar{\mathbf{m}}_{i}^{t}$ for updating the states of regions in frame $I_{i+1}$ . The TD-LSTM unit recurrently updates the states of all regions in each frame, and thus the past temporal information in preceding frames can be utilized for updating each frame. The TD-Graph LSTM layer is illustrated in Figure 3.

4 Region-level Classification Module

Experiments

Dataset Analysis. We evaluate the action-drive weakly-supervised object detection performance on the Charades dataset . The Charades video dataset is composed of daily indoor activities collected through Amazon Mechanical Turk. There are 157 action classes and on average 6.8 actions in each video, which occur in various orders and contexts. In order to detect objects in videos by using action labels, we only consider the action labels that are related to objects for training. Therefore, there are 66 action labels that are related to 17 object classes in our experiments. We show distribution of object classes (in a random subset of videos) in Figure 5 (a). The training set contains 7,542 videos. Videos are down-sampled to 1 fps and we only sample the frames assigned with action labels in each video. During training, only frame-level action labels are provided for each video.

In order to evaluate the video object detection performance over 17 daily object classes, we collect the bounding box annotations for 5,000 test frames from 200 videos in the Charades test set. The bounding box number distribution in each frame is shown in Figure 5 (b), ranging from 1 to 23 boxes appearing in the frame. More than 60% frames have more than 4 bounding boxes and most video frames exhibit severe motion blurs and low resolution. This poses more challenges for the object detection model compared to an image-based object detection dataset, such as the most popular PASCAL VOC that is widely used in existing weakly-based object detection methods. Figure 4 further shows example frames with action labels on the Charades dataset. It can be seen that each action label only provides one piece of object class information for the frame that may contain several object classes, which can be regarded as the missing label issue for training a model under this action-driven setting. Moreover, the video frames often appear with a very cluttered background, blurry objects and diverse viewpoints, which are more challenging and realistic compared to existing image datasets (e.g., MS COCO and ImageNet) and video datasets (e.g., UCF101).

Evaluation Measures. We evaluate the performance of both object detection and image classification tasks on Charades. For detection, we report the average precision (AP) at 50% intersection-over-union (IOU) of the detected boxes with the ground truth boxes. For classification, we also report the AP on frame-level object classification.

2 Implementation Details

Our TD-Graph LSTM adopts the VGG-CNN-F model pre-trained on ImageNet ILSVRC 2012 challenge data as the base model, and replaces the last pooling layer $pool5$ with an SPP layer to be compatible with the first fully connected layer. We use the EdgeBoxes algorithm to generate the top 500 regions that have width and height larger than 20 pixels as candidate regions for each frame. To balance the performance and time cost, we set the number of edges linked to each node $K$ to 100. For training, we use stochastic gradient descent with momentum 0.9 and weight decay $5\times 10^{-4}$ . All weight matrices used in the TD-Graph LSTM units are randomly initialized from a uniform distribution of $[-0.1,0.1]$ . TD-Graph LSTM predicts the hidden and memory states with the same dimension as the previous region-level CNN features. Each mini-batch contains at most 6 consecutive sampled frames in a video. The network is trained on the Charades training set by using fine-tuning on all layers, including those of the pre-trained base CNN model. The experiments are run for 30 epochs for the model convergence. The learning rates are set to $10^{-5}$ for the first ten epochs, then decreased to $10^{-6}$ . All our models are implemented on the public Torch platform, and all experiments are conducted on a single NVIDIA GeForce GTX TITAN X GPU with 12 GB memory. The runtime is 2.5 fps and 3.9 fps for training and testing respectively.

3 Results and Comparisons

We compare the proposed TD-Graph LSTM model with two state-of-the-art weakly-supervised learning methods on the Charades dataset, WSDDN and ContextLocNet . As both of the two methods were proposed for image-based weakly-supervised image object detection, here we run the source code of ContextLocNet and their reproduced WSDDNhttps://github.com/vadimkantorov/contextlocnet on the Charades dataset to make a fair comparison with our method. Their models are trained by treating the action-related object labels in each frame as the supervision information and are evaluated on each video frame. The difference between our model and WSDDN is our usage of TD-Graph LSTM layers to leverage rich temporal correlations in the whole video. Similar to WSDDN, ContextLocNet is also a two stream model with an enhanced localization module using various surrounding context. Specifically, we use the contrastive-S setup of ContextLocNet. All of these models use the same base model and region proposal method, i.e., VGG-CNN-F model and EdgeBoxes .

We report the comparisons with two state-of-the-art on classification mAP and detection mAP in Table 1 and Table 2, respectively. It can be observed that our TD-Graph LSTM model substantially outperforms two baselines on both classification mAP and detection mAP, particularly, 3.05% higher than ContextLocNet and 3.85% than WSDDN in terms of classification mAP. Especially, our TD-Graph LSTM surpasses two baselines in small objects, e.g., over 14.13% for pillow class and 6.93% for cup class. Although our model and two baselines all obtain low detection mAP under this challenging setting, our TD-Graph LSTM still surpasses two baselines on detecting crowded and small objects in the video. The superiority of our TD-Graph LSTM clearly demonstrates its effectiveness in challenging action-driven weakly-supervised object detection where the missing label issue is quite severe and a considerable number of bounding boxes appear in each frame with very low quality. We further show the qualitative comparison with two state-of-the-arts in Figure 7. Our model is able to produce more precise object detection for even very small objects (e.g., the cup in the middle row) and objects with heavy occlusion (e.g., the sofa in the bottom row). Our TD-Graph LSTM takes the advantage of exploiting complex temporal correlations between region proposals by propagating knowledge into a whole dynamic temporal graph, which effectively alleviates the critical missing label issue, as shown in Figure 6.

4 Ablation Study

The results of model variants are reported in Table 1, Table 2 and Table 3.

The effectiveness of incorporating graph. The main difference between our TD-Graph with a conventional LSTM structure for sequential modeling is in propagating information over a dynamic graph structure. To verify its effectiveness, we thus compare our full model with the variant “TD-Graph LSTM w/o graph” that eliminates the edge connections between regions in consecutive frames, and updates the frame-level hidden and memory states with the original region-level features. Our TD-Graph LSTM consistently obtains better results over “TD-Graph LSTM w/o graph”, which speaks to the advantage of incorporating a graph for the challenging action-driven object detection.

The effectiveness of temporal LSTM. We further verify that recurrent sequential modeling by the LSTM units over the temporal graph is beneficial for exploiting complex object motion patterns in daily videos. “TD-Graph LSTM w/o LSTM” indicates removing the LSTM units and directly aggregating the temporal context features to enhance features of each region. The performance gap between our full model and “TD-Graph LSTM w/o LSTM” verifies the benefits of adopting LSTM.

Dynamic graph vs Static graph vs Mean graph. Besides the proposed dynamic graph, another commonly used alternative is the fully-connected graph where each region is densely connected with all regions in the preceding frame; that is, “Ours w/ Static Graph” and “Ours w/ Mean Graph”. “Ours w/ Static Graph” uses the adaptive edge weights similar to TD-Graph LSTM while “Ours w/ Mean Graph” uses the same weights for all edge connections. It can be seen that applying a dynamic graph structure can help significantly boost both detection and classification performance over other fully-connected graphs. The reason is that meaningful temporal correlations between regions can be discovered by the dynamic graph and leveraged to transfer motion context into the whole video.

Conclusion

In this paper, we propose a novel temporal dynamic graph LSTM architecture to address action-driven weakly-supervised object detection. It recurrently propagates the temporal context on a constructed dynamic graph structure for each frame. The global action knowledge in the whole video can be effectively leveraged for object detection in each frame, which helps alleviate the missing label problem. Extensive experiments on a large-scale daily-life action dataset Charades demonstrate the superiority of our model over the state-of-the-arts.

Acknowledgements: This work was supported by ONR MURI N000141612007 and Sloan Fellowship to AG. XL was supported by the Department of Defense under Contract No. FA8702-15-D-0002 with CMU for the operation of the Software Engineering Institute, a federally funded research and development center.

References

Supplementary Materials

We visualize more detection results of the TD-Graph LSTM on the Charades dataset.