CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

Yuanjun Xiong, Limin Wang, Zhe Wang, Bowen Zhang, Hang Song, Wei Li, Dahua Lin, Yu Qiao, Luc Van Gool, Xiaoou Tang

Introduction

In the past several years, the advance in deep learning techniques has given rise to a new wave of efforts towards vision-based action understanding. A number of deep learning based frameworks, including two-stream CNNs , 3D CNNs (C3D) , and Trajectory-pooled Deep convolutional Descriptors (TDD) , have been developed, which significantly pushed forward the state-of-the-art . Such improvement on performance, to a large extent, is owning to both the modeling capacity of deep architectures and more effective learning strategies.

However, it is worth noting that previous efforts focus mainly on the analysis of short video clips. These clips are typically extracted from longer videos such that they only contain the portions of frames that truly capture the actions of interest. Obviously, preparation of such data is a laborious procedure. Action recognition from untrimmed videos, a problem that is more pertinent to real-world demands, is drawing increasing attention from the community. While substantially reducing the efforts needed in manual annotation, this task on the other hand presents a new challenge to the recognition system – a significant (or even dominant) fraction of a given video is irrelevant to the action of interest.

Driven by the ActivityNet benchmark , we develop an integrated approach to recognizing actions from untrimmed videosCodes and models are available at https://github.com/yjxiong/anet2016-cuhk. Our approach follows the framework of temporal segment networks presented in our earlier paper , which allows modeling long-range temporal structure in actions and introduces various techniques to improve the training procedure, e.g. temporal pre-training, and scale jittering augmentation. On top of this framework, we develop several new techniques to further improve the recognition accuracy. While visual analysis plays a primary role in this task, we notice that the audio channels that come with these videos provide complementary information. To exploit such information, we develop a deep network called Audio CNN to derive complementary features from the spectrograms.

Combining both the visual and acoustic models, we attain a high recognition accuracy (mAP $93.23\%$ on testing set). We want to emphasize that this performance is obtained only using the training data provided by the ActivityNet benchmark except using CNNs pre-trained on ILSVRC12 data for initialization – no additional data or annotations are used throughout both the training and testing procedures.

The rest of this paper is organized as follows. Section 2 presents our approach in detail, Section 3 reports our results under a variety of settings, finally Section 4 concludes this work.

Our Approach

Our approach to untrimmed video classification comprises two complementary components: visual and acoustic modeling. The visual analysis, which combines a variety of techniques, plays a primary role in this framework, while the acoustic model exploits complementary information from the audio channels to further improve the performance. Next, we present these components respectively in Section 2.1 and 2.2.

Our visual analysis component works as follows: it samples multiple snippets from a given video, makes snippet-wise predictions using very deep two-stream CNNs, and finally aggregates the predictions via different strategies such as top-k and attention-weighted pooling.

Deep convolutional neural networks (CNN) which learns from multiple modality of input data has been used extensively in visual recognition tasks and achieved superiority over models using a single modality. The snippet-wise predictor in our approach is a realization of temporal segment network framework which consists appearance and motion modeling parts. In this work, we adopt the recently proposed network architectures such as ResNet and Inception V3 to improve the capacity of the frame-wise predictor.

During training of the snippet-wise predictor, the techniques introduced in , such as scale jittering and stronger dropout, are also applied to the these architectures. The basic idea of temporal segment networks is to sample several snippets from one input video to jointly train the CNNs by averaging the per-snippet prediction. We also experimented with more advanced aggregation techniques into the training process.

Video-level Classification

To obtain video-level classification results, we use the following strategy: the snippet-wise predictor is first applied to an input video snippet with a $1$ FPS sampling rate, then an aggregation module will combine the snippet-wise class scores into the final prediction. We experimented with several advanced strategies for combing snippet-wise scores of the appearance nets. These include top- $k$ pooling and attention weighted pooling. These strategies, when used in both training and testing, produced models that are complementary to each other and thus form effective components in the final ensemble.

2 Acoustic Analysis System

Audio signals in a video carry important cues for recognizing some action classes. To harness the information in this aspect, we combine the standard MFCC representations with audio-based CNNs to form the acoustic modeling system.

Mel Frequency Cepstral Coefficients (MFCC) is a powerful feature descriptor used in automatic speech recognition system. In our approach, we extract MFCC features from companioned audios of the videos in the dataset, and train SVMs on descriptors aggregated with Fisher Vector

Audio CNN

The basic idea of Audio CNN works is to apply CNNs on spectrograms, or time-frequency-response maps, of audio signals. In this work, we propose to directly use the grayscale time-frequency map image to train the audio CNN. Then the audio CNN can be initialized by the same technique used on the temporal networks in . It is also known that learning from multiple time scales help in acoustic models . In this sense, we propose to stack multiple spectrograms with varying window size as the input to the audio CNN.

Experiments

We train our models on the official training set of ActivityNet v $1.3$ dataset . There are $10,024$ videos for training, enclosing $15410$ activity instances from $200$ activity classes. The validation set contains $4926$ videos and $7654$ activity instances. We study the performance of our approach on this validation set. The final testing set comprises $5044$ videos and is not annotated with any activity instance. We report the performance of our proposed models on this set according to the feedback of the test server of the challenge. Models for this setting are trained with the union of training and validation set.

In experiments, we compare the performance of temporal segment networks using several network architectures, including BN-Inception , Inception V3 , and ResNet . The performance of different network structures for spatial and temporal stream are summarized in Table 1. To analyze the effect of different training strategies, we compare the performance of appearance modeling CNNs with these strategies. The results are presented in Table 2. The contributions of appearance and motion CNNs are also summarized in Table 3. Then we report the performance of the two components in the acoustic analysis systems in Table 4.

Finally, we evaluate the fusion of visual analysis system and audio analysis system on both the validation and testing set. The results are illustrated in Table 5. The best mAP achieved by the final ensemble is $93.2\%$ . We also took one chance on the testing server to evaluate a combination of one appearance CNN and one motion CNN. Its results are presented as “Visual CNN (Single)” in Table 5. It is exciting to see using this “single model” setting we can still achieve a reasonable mAP of $91.2\%$ , which may better fit for industrial applications.

Conclusions

This paper has proposed an action recognition method for classifying temporally untrimmed videos. It is based on the idea of combining visual analysis and acoustic analysis. The results show that by carefully designing the visual and acoustic analysis systems and combining them, we can achieve exciting results in video classification tasks and boost the performance of state-of-the-art methods. Another fact to be noticed is that this high accuracy is achieved by evaluating only $1$ frame per second, equivalent to only seeing around $4\%$ of all frames of input videos. We believe this property is also very important for practically applying the system in industrial scenarios.

Acknowledgment

This work was supported by the Big Data Collaboration Research grant from SenseTime Group (CUHK Agreement No. TS1610626) and ERC Advanced Grant Varcity (No. 273940).