Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge

Gurkirt Singh, Fabio Cuzzolin

Introduction

Emerging real-world applications require an all-round approach to the machine understanding of human behaviour, which goes beyond the recognition of simple, isolated activities from video.

As a step towards this ambitious goal, in this work we address the problem of detecting the temporal bounds of activities in temporally untrimmed videos.

Methodology

Whereas (i) video-level features are used for untrimmed video classification task, (ii) frame-level features are used for activity proposal generation and scoring. Finally, (iii) a video’s classification score is augmented with the scores of the activity proposals for proposal classification.

We make use of the features provided on ActivityNet’s web pagehttp://activity-net.org/challenges/2016/download.html.

ImageNetShuffle features are video-level features generated by using a Google inception net (GoogLeNet ). CNN features are extracted from the pool5 layer of GoogLeNet at a two frames per second rate. Frame-level CNN features are mean pooled to construct a representation for the whole video. Mean pooling is followed by L1-normalisation.

We train a one-versus-rest linear SVM for each class, and use the resulting SVM scores $S^{i}=\{s_{1}^{i},...,s_{c}^{i},...s_{C}^{i}\}$ , where $C$ is number of classes, as INS features.

Motion Boundary Histogram (MBH) features are generated with the aid of the improved trajectories executablehttp://lear.inrialpes.fr/people/wang/improved_trajectories. We train another battery of one-versus-rest SVMs using a linear kernel on the MBH features, and use the resulting SVM scores $S^{m}=\{{s_{1}^{m},...,s_{c}^{m},...s_{C}^{m}}\}$ as global video features.

1.2 Frame level features

C3D Features features are generated at 2 frames per second using a C3D network with temporal resolution of 16 frames. Once again we train a frame level one-versus-rest SVM classifier for each activity class using a linear kernel. The scoring of frame $t$ is defined by the resulting SVM scores: $S^{3}_{t}=\{{s_{1}^{3},...,s_{c}^{3},...s_{C}^{3}}\}$ . Finally, we perform mean pooling along the frames for each class to get another score vector $S^{3}$ , which is used for video classification.

2 Untrimmed video classification

Untrimmed video classification is achieved by fusing all video level scores using a linear SVM as a meta classifier. Video level scores ( $S^{i}$ , $S^{m}$ and $S^{3}$ ) are stacked up to make a single score vector. A linear SVM is trained on the training set of stacked scores, and evaluated on the validation and testing sets. The output scores $S^{s}$ outputted by the meta SVM are normalised by dividing them by the sum of the top $k$ scores. The parameter $k$ was cross-validated on the validation set and set to 3 – it contributes to improve the mean average precision metric.

We believe that, since SVM scores are not probabilities, normalisation by top $k$ scores is required to be able to compare them across all videos.

3 Activity detection in untrimmed videos

Activity proposals are detected by (i) training a binary random forest (RF) classifier for each class on the frame-level C3D features, and (ii) casting activity proposal generation as an optimisation problem , which makes use of these binary decisions.

The binary RF classifies each frame into a negative (i.e. no activity taking place) or a positive bin (i.e. something is happening). The positive score of a frame $t$ is denoted by $s^{r}_{t}$ . Temporal trimming is then achieved by dynamic programming as follows.

3.2 Activity proposal generation

Given the frame-level scores $\{s^{r}_{t},t=1,...,T\}$ for a video of length $T$ , we want to assign to each frame a binary label ${l}_{t}$ $\in$ $\{1,0\}$ (where zero represents the ‘background’ or ‘no-activity’ class), which maximises:

where $\lambda$ is a scalar parameter, and the pairwise potential $\psi_{l}$ is defined as:

$\psi_{l}(l_{t},l_{t-1})=0$ if $l_{t}=l_{t-1}$ , $\psi_{l}(l_{t},l_{t-1})=\alpha$ otherwise,

(where $\alpha$ is a parameter which we set by cross validation). This penalises labellings $L=\{l_{1},...,l_{T}\}$ which are not smooth, thus enforcing a piecewise constant solution. All contiguous sub-sequences form the desired activity proposal (which can be as many as there are instances of activities). Each activity proposal is assigned a global score $S_{a}$ equal to the mean of the scores of its constituting frames. This optimisation problem can be efficiently solved by dynamic programming . It can easily be extended for simultaneous detection and classification .

3.3 Activity detection

The top (in this case 2) activity proposals in each video are assigned the label of top untrimmed classification class (§2.2). For example, if $c={10}$ is the top class for the video with score $S_{10}^{s}$ , and $a$ is the top activity proposal with score $S_{a}$ (§2.3.2), then a detection of class $10$ is flagged with the temporal bounds determined by activity proposal $a$ and score $S_{10}^{a}=S_{10}^{s}*S_{a}$ . Similarly, we can generate more detections for each of the top classes by using top activity proposals.

Implementation

We used the precomputed features provided by the competition organisers. We used SciKit-learn for linear SVM and random forest Implementation. Our code available at https://github.com/gurkirt/actNet-inAct.

Results

We report results for untrimmed classification and activity detection on ActivityNet . We use the same evaluation setting as described in challenge .

2 Activity detection

Conclusion and Future Work

We show that activity detection can be achieved via untrimmed video classification. Our dynamic programming-based approach is efficient, and has shown a clear potential for generating good quality activity proposal.

The approach can be easily extended for simultaneous detection and classification without requiring classification scores at video level, which open ups the opportunity for online activity classification, detection and prediction.