Single Shot Temporal Action Detection

Tianwei Lin, Xu Zhao, Zheng Shou

Introduction

Due to the continuously booming of videos on the internet, video content analysis has attracted wide attention from both industry and academic field in recently years. An important branch of video content analysis is action recognition, which usually aims at classifying the categories of manually trimmed video clips. Substantial progress has been reported for this task in (Wang and Schmid, 2013; Feichtenhofer et al., 2016; Wang et al., 2015; Qiu et al., 2016; Tran et al., 2015). However, most videos in real world are untrimmed and may contain multiple action instances with irrelevant background scenes or activities. This problem motivates the academic community to put attention to another challenging task - temporal action detection. This task aims to detect action instances in untrimmed video, including temporal boundaries and categories of instances. Methods proposed for this task can be used in many areas such as surveillance video analysis and intelligent home care.

Temporal action detection can be regarded as a temporal version of object detection in image, since both of the tasks aim to determine the boundaries and categories of multiple instances (actions in time/ objects in space). A popular series of models in object detection are R-CNN and its variants (Girshick, 2015; Girshick et al., 2014; Ren et al., 2015), which adopt the ”detect by classifying region proposals” framework. Inspired by R-CNN, recently many temporal action detection approaches adopt similar framework and classify temporal action instances generated by proposal method (Shou et al., 2016; Caba Heilbron et al., 2016; Yu and Yuan, 2015; Escorcia et al., 2016) or simple sliding windows method (Oneata et al., 2014; Karaman et al., 2014; Wang et al., 2014). This framework may has some major drawbacks: (1) proposal generation and classification procedures are separate and have to be trained separately, but ideally we want to train them in a joint manner to obtain an optimal model; (2) the proposal generation method or sliding windows method requires additional time consumption; (3) the temporal boundaries of action instances generated by the sliding windows method are usually approximative rather than precise and left to be fixed during classification. Also, since the scales of sliding windows are pre-determined, it is not flexible to predict instances with various scales.

To address these issues, we propose the Single Shot Action Detector (SSAD) network, which is a temporal convolutional network conducted on feature sequence with multiple granularities. Inspired by another set of object detection methods - single shot detection models such as SSD (Liu et al., 2016) and YOLO (Redmon et al., 2016; Redmon and Farhadi, 2016), our SSAD network skips the proposal generation step and directly predicts temporal boundaries and confidence scores for multiple action categories, as shown in Figure 1. SSAD network contains three sub-modules: (1) base layers read in feature sequence and shorten its temporal length; (2) anchor layers output temporal feature maps, which are associated with anchor action instances; (3) prediction layers generate categories probabilities, location offsets and overlap scores of these anchor action instances.

For better encoding of both spatial and temporal information in video, we adopt multiple action recognition models (action classifiers) to extract multiple granularities features. We concatenate the output categories probabilities from all action classifiers in snippet-level and form the Snippet-level Action Score (SAS) feature. The sequences of SAS features are used as input of SSAD network.

Note that it is non-trivial to adapt the single shot detection model from object detection to temporal action detection. Firstly, unlike VGGNet (Simonyan and Zisserman, 2015) being used in 2D ConvNet models, there is no existing widely used pre-trained temporal convolutional network. Thus in this work, we search multiple network architectures to find the best one. Secondly, we integrate key advantages in different single shot detection models to make our SSAD network work the best. On one hand, similar to YOLO9000 (Redmon and Farhadi, 2016), we simultaneously predict location offsets, categories probabilities and overlap score of each anchor action instance. On the other hand, like SSD (Liu et al., 2016), we use anchor instances of multiple scale ratios from multiple scales feature maps, which allow network flexible to handle action instance with various scales. Finally, to further improve performance, we fuse the prediction categories probability with temporal pooled snippet-level action scores during prediction.

The main contributions of our work are summarized as follows:

(1) To the best of our knowledge, our work is the first Single Shot Action Detector (SSAD) for video, which can effectively predict both the boundaries and confidence score of multiple action categories in untrimmed video without the proposal generation step.

(2) In this work, we explore many configurations of SSAD network such as input features type, network architectures and post-processing strategy. Proper configurations are adopted to achieve better performance for temporal action detection task.

(3) We conduct extensive experiments on two challenging benchmark datasets: THUMOS’14 (Jiang et al., 2014) and MEXaction2 (mex, 2015). When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from $19.0\%$ to $24.6\%$ on THUMOS’14 and from $7.4\%$ to $11.0\%$ on MEXaction2.

Related Work

Action recognition. Action recognition is an important research topic for video content analysis. Just as image classification network can be used in image object detection, action recognition models can be used in temporal action detection for feature extraction. We mainly review the following methods which can be used in temporal action detection. Improved Dense Trajectory (iDT) (Wang et al., 2011; Wang and Schmid, 2013) feature is consisted of MBH, HOF and HOG features extracted along dense trajectories. iDT method uses SIFT and optical flow to eliminate the influence of camera motion. Two-stream network (Feichtenhofer et al., 2016; Simonyan and Zisserman, 2014; Wang et al., 2015) learns both spatial and temporal features by operating network on single frame and stacked optical flow field respectively using 2D Convolutional Neural Network (CNN) such as GoogleNet (Szegedy et al., 2015), VGGNet (Simonyan and Zisserman, 2015) and ResNet (He et al., 2016). C3D network (Tran et al., 2015) uses 3D convolution to capture both spatial and temporal information directly from raw video frames volume, and is very efficient. Feature encoding methods such as Fisher Vector (Wang and Schmid, 2013) and VAE (Qiu et al., 2016) are widely used in action recognition task to improve performance. And there are many widely used action recognition benchmark such as UCF101 (Soomro et al., 2012), HMDB51 (Kuehne et al., 2013) and Sports-1M (Karpathy et al., 2014).

Temporal action detection. This task focuses on learning how to detect action instances in untrimmed videos where the boundaries and categories of action instances have been annotated. Typical datasets such as THUMOS 2014 (Jiang et al., 2014) and MEXaction2 (mex, 2015) include large amount of untrimmed videos with multiple action categories and complex background information.

Recently, many approaches adopt ”detection by classification” framework. For examples, many approaches (Oneata et al., 2014; Karaman et al., 2014; Singh and Cuzzolin, 2016; Wang et al., 2014; Wang and Tao, 2016) use extracted feature such as iDT feature to train SVM classifiers, and then classify the categories of segment proposals or sliding windows using SVM classifiers. And there are some approaches specially proposed for temporal action proposal (Caba Heilbron et al., 2016; Yu and Yuan, 2015; Escorcia et al., 2016; Gemert et al., 2015; Mettes et al., 2016). Our SSAD network differs from these methods mainly in containing no proposal generation step.

Recurrent Neural Network (RNN) is widely used in many action detection approaches (Yeung et al., 2016; Yuan et al., 2016; Ma et al., 2016; Singh et al., 2016) to encode feature sequence and make per-frame prediction of action categories. However, it is difficult for RNNs to keep a long time period memory in practice (Singh et al., 2016). An alternative choice is temporal convolution. For example, Lea et al. (Lea et al., 2016) proposes Temporal Convolutional Networks (TCN) for temporal action segmentation. We also adopt temporal convolutional layers, which makes our SSAD network can handle action instances with a much longer time period.

Object detection. Deep learning approaches have shown salient performance in object detection. We will review two main set of object detection methods proposed in recent years. The representative methods in first set are R-CNN (Girshick et al., 2014) and its variations (Girshick, 2015; Ren et al., 2015). R-CNN uses selective search to generate multiple region proposals then apply CNN in these proposals separately to classify their categories; Fast R-CNN (Girshick, 2015) uses a 2D RoI pooling layer which makes feature map be shared among proposals and reduces the time consumption. Faster RCNN (Ren et al., 2015) adopts a RPN network to generate region proposal instead of selective search.

Another set of object detection methods are single shot detection methods, which means detecting objects directly without generating proposals. There are two well known models. YOLO (Redmon et al., 2016; Redmon and Farhadi, 2016) uses the whole topmost feature map to predict probabilities of multiple categories and corresponding confidence scores and location offsets. SSD (Liu et al., 2016) makes prediction from multiple feature map with multiple scales default boxes. In our work, we combine the characteristics of these single shot detection methods and embed them into the proposed SSAD network.

Our Approach

In this section, we will introduce our approach in details. The framework of our approach is shown in Figure 2.

We denote a video as $X_{v}=\left\{x_{t}\right\}_{t=1}^{T_{v}}$ where $T_{v}$ is the number of frames in $X_{v}$ and $x_{t}$ is the $t$ -th frame in $X_{v}$ . Each untrimmed video $X_{v}$ is annotated with a set of temporal action instances $\Phi_{v}=\left\{\phi_{n}=\left(\varphi_{n},\varphi^{\prime}_{n},k_{n}\right)\right\}_{n=1}^{N_{v}}$ , where $N_{v}$ is the number of temporal action instances in $X_{v}$ , and $\varphi_{n},\varphi^{\prime}_{n},k_{n}$ are starting time, ending time and category of action instance $\phi_{n}$ respectively. $k_{n}\in\left\{1,...,K\right\}$ where $K$ is the number of action categories. $\Phi_{v}$ is given during training procedure and need to be predicted during prediction procedure.

2. Extracting of Snippet-level Action Scores

To apply SSAD model, first we need to make snippet-level action classification and get Snippet-level Action Score (SAS) features. Given a video $X_{v}$ , a snippet $s_{t}=\left(x_{t},F_{t},X_{t}\right)$ is composed by three parts: $x_{t}$ is the $t$ -th frame in $X_{v}$ , $F_{t}=\left\{f_{t^{\prime}}\right\}_{t^{\prime}=t-4}^{t+5}$ is stacked optical flow field derived around $x_{t}$ and $X_{t}=\left\{x_{t^{\prime}}\right\}_{t^{\prime}=t-7}^{t+8}$ is video frames volume. So given a video $X_{v}$ , we can get a sequence of snippets $S_{v}=\left\{s_{t}\right\}_{t=1}^{T_{v}}$ . We pad the video $X_{v}$ in head and tail with first and last frame separately to make $S_{v}$ have the same length as $X_{v}$ .

Action classifier. To evaluate categories probability of each snippet, we use multiple action classifiers with commendable performance in action recognition task: two-stream network (Simonyan and Zisserman, 2014) and C3D network (Tran et al., 2015). Two-stream network includes spatial and temporal networks which operate on single video frame $x_{t}$ and stacked optical flow field $F_{t}$ respectively. We use the same two-stream network architecture as described in (Wang et al., 2015), which adopts VGGNet-16 network architecture. C3D network is proposed in (Tran et al., 2015), including multiple 3D convolution layers and 3D pooling layers. C3D network operates on short video frames volume $X_{t}$ with length $l$ , where $l$ is the length of video clip and is set to 16 in C3D. So there are totally three individual action classifiers, in which spatial network measures the spatial information, temporal network measures temporal consistency and C3D network measures both. In section 4.3, we evaluate the effect of each action classifier and their combinations.

SAS feature. As shown in Figure 2(a), given a snippet $s_{t}$ , each action classifier can generate a score vector $\bm{p_{t}}$ with length $K^{\prime}=K+1$ , where $K^{\prime}$ includes $K$ action categories and one background category. Then we concatenate output scores of each classifiers to form the Snippet-level Action Score (SAS) feature $\bm{p_{sas,t}}=\left(\bm{p_{S,t}},\bm{p_{T,t}},\bm{p_{C,t}}\right)$ , where $\bm{p_{S,t}}$ , $\bm{p_{T,t}}$ , $\bm{p_{C,t}}$ are output score of spatial, temporal and C3D network separately. So given a snippets sequence $S_{v}$ with length $T_{v}$ , we can extract a SAS feature sequence $P_{v}=\left\{\bm{p_{sas,t}}\right\}_{t=1}^{T_{v}}$ . Since the number of frames in video is uncertain and may be very large, we use a large observation window with length $T_{w}$ to truncate the feature sequence. We denote a window as $\omega=\left\{\varphi_{\omega},\varphi^{\prime}_{\omega},P_{\omega},\Phi_{\omega}\right\}$ , where $\varphi_{\omega}$ and $\varphi^{\prime}_{\omega}$ are starting and ending time of $\omega$ , $P_{\omega}$ and $\Phi_{\omega}$ are SAS feature sequence and corresponding ground truth action instances separately.

3. SSAD Network

Temporal action detection is quite different from object detection in 2D image. In SSAD we adopt two main characteristics from single shot object detection models such as SSD (Liu et al., 2016) and YOLO (Redmon et al., 2016; Redmon and Farhadi, 2016): 1) unlike ”detection by classification” approaches, SSAD directly predicts categories and location offsets of action instances in untrimmed video using convolutional prediction layers; 2) SSAD combine temporal feature maps from different convolution layers for prediction, making it possible to handle action instances with various length. We first introduce the network architecture.

Network architecture. The architecture of SSAD network is presented in Figure 2(b), which mainly contains three sub-modules: base layers, anchor layers and prediction layers. Base layers handle the input SAS feature sequence, and use both convolution and pooling layer to shorten the temporal length of feature map and increase the size of receptive fields. Then anchor layers use temporal convolution to continually shorten the feature map and output anchor feature map for action instances prediction. Each cell of anchor layers is associated with anchor instances of multiple scales. Finally, we use prediction layers to get classification score, overlap score and location offsets of each anchor instance.

In SSAD network, we adopt 1D temporal convolution and pooling to capture temporal information. We conduct Rectified Linear Units (ReLu) activation function (Glorot et al., 2011) to output temporal feature map except for the convolutional prediction layers. And we adopt temporal max pooling since max pooling can enhance the invariance of small input change.

Base layers. Since there are no widely used pre-trained 1D ConvNet models such as the VGGNet (Simonyan and Zisserman, 2015) used in 2D ConvNet models, we search many different network architectures for SSAD network. These architectures only differ in base layers while we keep same architecture of anchor layers and prediction layers. As shown in Figure 3, we totally design 5 architectures of base layers. In these architectures, we mainly explore three aspects: 1) whether use convolution or pooling layer to shorten the temporal dimension and increase the size of receptive fields; 2) number of layers of network and 3) size of convolution layer’s kernel. Notice that we set the number of convolutional filter in all base layers to 256. Evaluation results of these architectures are shown in section 4.3, and finally we adopt architecture $B$ which achieves the best performance.

Multi-scale anchor layers. After processing SAS feature sequence using base layers, we stack three anchor convolutional layers (Conv-A1, Conv-A2 and Conv-A3) on them. These layers have same configuration: kernel size 3, stride size 2 and 512 convolutional filters. The output anchor feature maps of anchor layers are $f_{A1}$ , $f_{A2}$ and $f_{A3}$ with size $(T_{w}/32\times 512)$ , $(T_{w}/64\times 512)$ and $(T_{w}/128\times 512)$ separately. Multiple anchor layers decrease temporal dimension of feature map progressively and allow SSAD get predictions from multiple resolution feature map.

For each temporal feature map of anchor layers, we associate a set of multiple scale anchor action instances with each feature map cell as shown in Figure 4. For each anchor instance, we use convolutional prediction layers to predict overlap score, classification score and location offsets, which will be introduced later.

In term of the details of multi-scale anchor instances, the lower anchor feature map has higher resolution and smaller receptive field than the top anchor feature map. So we let the lower anchor layers detect short action instances and the top anchor layers detect long action instances. For a temporal feature map $f$ of anchor layer with length $M$ , we define base scale $s_{f}=\frac{1}{M}$ and a set of scale ratios $R_{f}=\left\{r_{d}\right\}_{d=1}^{D_{f}}$ , where $D_{f}$ is the number of scale ratios. We use $\{1,1.5,2\}$ for $f_{A1}$ and $\{0.5,0.75,1,1.5,2\}$ for $f_{A2}$ and $f_{A3}$ . For each ratio $r_{d}$ , we calculate $\mu_{w}=s_{f}\cdot r_{d}$ as anchor instance’s default width. And all anchor instances associated with the $m$ -th feature map cell share the same default center location $\mu_{c}=\frac{m+0.5}{M}$ . So for an anchor feature map $f$ with length $M_{f}$ and $D_{f}$ scale ratios, the number of associated anchor instances is $M_{f}\cdot D_{f}$ .

Prediction layers. We use a set of convolutional filters to predict classification scores, overlap scores and location offsets of anchor instances associated with each feature map cell. As shown in Figure 4, for an anchor feature map $f$ with length $M_{f}$ and $D_{f}$ scale ratios, we use $D_{f}\cdot(K^{\prime}+3)$ temporal convolutional filters with kernel size 3, stride size 1 for prediction. The output of prediction layer has size $\left(M_{f}\times\left(D_{f}\cdot(K^{\prime}+3)\right)\right)$ and can be reshaped into $\left(\left(M_{f}\cdot D_{f}\right)\times(K^{\prime}+3)\right)$ . Each anchor instance gets a prediction score vector $\bm{p_{pred}}=\left(\bm{p_{class}},p_{over},\Delta c,\Delta w\right)$ with length $(K^{\prime}+3)$ , where $\bm{p_{class}}$ is classification score vector with length $K^{\prime}$ , $p_{over}$ is overlap score and $\Delta c$ , $\Delta w$ are location offsets. Classification score $p_{class}$ is used to predict anchor instance’s category. Overlap score $p_{over}$ is used to estimate the overlap between anchor instance and ground truth instances and should have value between $$, so it is normalized by using sigmoid function:

And location offsets $\Delta c$ , $\Delta w$ are used for adjusting the default location of anchor instance. The adjusted location is defined as:

where $\varphi_{c}$ and $\varphi_{w}$ are center location and width of anchor instance respectively. $\alpha_{1}$ and $\alpha_{2}$ are used for controlling the effect of location offsets to make prediction stable. We set both $\alpha_{1}$ and $\alpha_{2}$ to 0.1. The starting and ending time of action instance are $\varphi=\varphi_{c}-\frac{1}{2}\cdot\varphi_{w}$ and $\varphi^{\prime}=\varphi_{c}+\frac{1}{2}\cdot\varphi_{w}$ respectively. So for a anchor feature map $f$ , we can get a anchor instances set $\Phi_{f}=\left\{\phi_{n}=\left(\varphi_{c},\varphi_{w},p_{class},p^{\prime}_{over}\right)\right\}_{n=1}^{N_{f}}$ , where $N_{f}=M_{f}\cdot D_{f}$ is the number of anchor instances. And the total prediction instances set is $\Phi_{p}=\left\{\Phi_{f_{A1}},\Phi_{f_{A2}},\Phi_{f_{A3}}\right\}$ .

4. Training of SSAD network

Training data construction. As described in Section 3.2, for an untrimmed video $X_{v}$ with length $T_{v}$ , we get SAS features sequence $P_{v}$ with same length. Then we slide window of length $T_{w}$ in feature sequence with $75\%$ overlap. The overlap of sliding window is aim to handle the situation where action instances locate in boundary of window and also used to increase the amount of training data. During training, we only keep windows containing at least one ground-truth instance. So given a set of untrimmed training videos, we get a training set $\Omega=\left\{\omega_{n}\right\}_{n=1}^{N_{\omega}}$ , where $N_{\omega}$ is the number of windows. We randomly shuffle the data order in training set to make the network converge faster, where same random seed is used during evaluation.

Label assignment. During training, given a window $\omega$ , we can get prediction instances set $\Phi_{p}$ via SSAD network. We need to match them with ground truth set $\Phi_{\omega}$ for label assignment. For an anchor instance $\phi_{n}$ in $\Phi_{p}$ , we calculate it’s IoU overlap with all ground truth instances in $\Phi_{\omega}$ . If the highest IoU overlap is higher than 0.5, we match $\phi_{n}$ with corresponding ground truth instance $\phi_{g}$ and regard it as positive, otherwise negative. We expand $\phi_{n}$ with matching information as $\phi^{\prime}_{n}=\left(\varphi_{c},\varphi_{w},\bm{p_{class}},p^{\prime}_{over},k_{g},g_{iou},g_{c},g_{w}\right)$ , where $k_{g}$ is the category of $\phi_{g}$ and is set to 0 for negative instance, $g_{iou}$ is the IoU overlap between $\phi_{n}$ and $\phi_{g}$ , $g_{c}$ and $g_{w}$ are center location and width of $\phi_{g}$ respectively. So a ground truth instance can match multiple anchor instances while a anchor instance can only match one ground truth instance at most.

Hard negative mining. During label assignment, only a small part of anchor instances match the ground truth instances, causing an imbalanced data ratio between the positive and negative instances. Thus we adopt the hard negative mining strategy to reduce the number of negative instances. Here, the hard negative instances are defined as negative instances with larger overlap score than 0.5. We take all hard negative instances and randomly sampled negative instances in remaining part to make the ratio between positive and negative instances be nearly 1:1. This ratio is chosen by empirical validation. So after label assignment and hard negative mining, we get $\Phi^{\prime}_{p}=\left\{\phi^{\prime}_{n}\right\}_{n=1}^{N_{train}}$ as the input set during training, where $N_{train}$ is the number of total training instances and is the sum of the number of positives $N_{pos}$ and negatives $N_{neg}$ .

Objective for training. The training objective of the SSAD network is to solve a multi-task optimization problem. The overall loss function is a weighted sum of the classification loss (class), the overlap loss (conf), the detection loss (loc) and L2 loss for regularization:

where $\alpha$ , $\beta$ and $\lambda$ are the weight terms used for balancing each part of loss function. Both $\alpha$ and $\beta$ are set to 10 and $\lambda$ is set to 0.0001 by empirical validation. For the classification loss, we use conventional softmax loss over multiple categories, which is effective for training classification model and can be defined as:

where $P_{i}^{(k_{g})}=\frac{exp(p_{class,i}^{(k_{g})})}{\sum_{j}exp(p_{class,i}^{(k_{j})})}$ and $k_{g}$ is the label of this instance.

$L_{over}$ is used to make a precise prediction of anchor instances’ overlap IoU score, which helps the procedure of NMS. The overlap loss adopts the mean square error (MSE) loss and be defined as:

$L_{loc}$ is the Smooth L1 loss (Girshick, 2015) for location offsets. We regress the center ( $\phi_{c}$ ) and width ( $\phi_{w}$ ) of predicted instance:

where $g_{c,i}$ and $g_{w,i}$ is the center location and width of ground truth instance. $L_{2}(\Theta)$ is the L2 regularization loss where $\Theta$ stands for the parameter of the whole SSAD network.

5. Prediction and post-processing

During prediction, we follow the aforementioned data preparation method during the training procedure to prepare test data, with the following two changes: (1) the overlap ratio of window is reduced to $25\%$ to increase the prediction speed and reduce the redundant predictions; (2) instead of removing windows without annotation, we keep all windows during prediction because the removing operation is actually a leak of annotation information. If the length of input video is shorter than $T_{w}$ , we will pad SAS feature sequence to $T_{w}$ so that there is at least one window for prediction. Given a video $X_{v}$ , we can get a set of $\Omega=\left\{\omega_{n}\right\}_{n=1}^{N_{\omega}}$ . Then we use SSAD network to get prediction anchors of each window and merge these prediction as $\Phi_{p}=\left\{\phi_{n}\right\}_{n=1}^{N_{p}}$ , where ${N_{p}}$ is the number of prediction instances. For a prediction anchor instance $\phi_{n}$ in $\Phi_{p}$ , we calculate the mean Snippet-level Action Score $\bm{\bar{p}}_{sas}$ among the temporal range of instance and multiple action classifiers.

where $\varphi$ and $\varphi^{\prime}$ are starting and ending time of prediction anchor instance respectively. Then we fuse categories scores $\bm{\bar{p}}_{sas}$ and $\bm{p}_{class}$ with multiplication factor $p_{conf}$ and get the $\bm{p_{final}}$ :

We choose the maximum dimension $k_{p}$ in $\bm{p}_{final}$ as the category of $\phi_{n}$ and corresponding score $p_{conf}$ as the confidence score. We expand $\phi_{n}$ as $\phi^{\prime}_{n}=\left\{\varphi_{c},\varphi_{w},p_{conf},k_{p}\right\}$ and get prediction set $\Phi^{\prime}_{p}=\left\{\phi^{\prime}_{n}\right\}_{n=1}^{N_{p}}$ . Then we conduct non-maximum suppress (NMS) in these prediction results to remove redundant predictions with confidence score $p_{conf}$ and get the final prediction instances set $\Phi^{\prime\prime}_{p}=\left\{\phi^{\prime}_{n}\right\}_{n=1}^{N_{p^{\prime}}}$ , where $N_{p^{\prime}}$ is the number of the final prediction anchors. Since there are little overlap between action instances of same category in temporal action detection task, we take a strict threshold in NMS, which is set to 0.1 by empirical validation.

Experiments

THUMOS 2014 (Jiang et al., 2014). The temporal action detection task of THUMOS 2014 dataset is challenging and widely used. The training set is the UCF-101 (Soomro et al., 2012) dataset including 13320 trimmed videos of 101 categories. The validation and test set contain 1010 and 1574 untrimmed videos separately. In temporal action detection task, only 20 action categories are involved and annotated temporally. We only use 200 validation set videos (including 3007 action instances) and 213 test set videos (including 3358 action instances) with temporal annotation to train and evaluate SSAD network.

MEXaction2 (mex, 2015). There are two action categories in MEXaction2 dataset: ”HorseRiding” and ”BullChargeCape”. This dataset is consisted of three subsets: YouTube clips, UCF101 Horse Riding clips and INA videos. YouTube and UCF101 Horse Riding clips are trimmed and used for training set, whereas INA videos are untrimmed with approximately 77 hours in total and are divided into training, validation and testing set. Regarding to temporal annotated action instances, there are 1336 instances in training set, 310 instances in validation set and 329 instances in testing set.

Evaluation metrics. For both datasets, we follow the conventional metrics used in THUMOS’14, which evaluate Average Precision (AP) for each action categories and calculate mean Average Precision (mAP) for evaluation. A prediction instance is correct if it gets same category as ground truth instance and its temporal IoU with this ground truth instance is larger than IoU threshold $\theta$ . Various IoU thresholds are used during evaluation. Furthermore, redundant detections for the same ground truth are forbidden.

2. Implementation Details

Action classifiers. To extract SAS features, action classifiers should be trained first, including two-stream networks (Wang et al., 2015) and C3D network (Tran et al., 2015). We implement both networks based on Caffe (Jia et al., 2014). For both MEXaction and THUMOS’14 datasets, we use trimmed videos in training set to train action classifier.

For spatial and temporal network, we follow the same training strategy described in (Wang et al., 2015) which uses the VGGNet-16 pre-trained on ImageNet (Deng et al., 2009) to intialize the network and fine-tunes it on training set. And we follow (Tran et al., 2015) to train the C3D network, which is pre-trained on Sports-1M (Karpathy et al., 2014) and then is fine-turned on training set.

SSAD optimization. For training of the SSAD network, we use the adaptive moment estimation (Adam) algorithm (Kingma and Ba, 2014) with the aforementioned multi-task loss function. Our implementation is based on Tensorflow (Abadi et al., 2016). We adopt the Xavier method (Glorot and Bengio, 2010) to randomly initialize parameters of whole SSAD network because there are no suitable pre-trained temporal convolutional network. Even so, the SSAD network can be easily trained with quick convergence since it has a small amount of parameters (20 MB totally) and the input of SSAD network - SAS features are concise high-level feature. The training procedure takes nearly 1 hour on THUMOS’14 dataset.

3. Comparison with state-of-the-art systems

Results on THUMOS 2014. To train action classifiers, we use full UCF-101 dataset. Instead of using one background category, here we form background categories using 81 action categories which are un-annotated in detection task. Using two-stream and C3D networks as action classifiers, the dimension of SAS features is 303.

For training of SSAD model, we use 200 annotated untrimmed video in THUMOS’14 validation set as training set. The window length $L_{w}$ is set to 512, which means approximately 20 seconds of video with 25 fps. This choice is based on the fact that $99.3\%$ action instances in the training set have smaller length than 20 seconds. We train SSAD network for 30 epochs with learning rate of 0.0001.

The comparison results between our SSAD and other state-of-the-art systems are shown in Table 1 with multiple overlap IoU thresholds varied from 0.1 to 0.5. These results show that SSAD significantly outperforms the compared state-of-the-art methods. While the IoU threshold used in evaluation is set to 0.5, our SSAD network improves the state-of-the-art mAP result from $19.0\%$ to $24.6\%$ . The Average Precision (AP) results of all categories with overlap threshold 0.5 are shown in Figure 5, the SSAD network outperforms other state-of-the-art methods for 7 out of 20 action categories. Qualitative results are shown in Figure 6.

Results on MEXaction2. For training of action classifiers, we use all 1336 trimmed video clips in training set. And we randomly sample 1300 background video clips in untrimmed training videos. The prediction categories of action classifiers are ”HorseRiding”, ”BullChargeCape” and ”Background”. So the dimension of SAS features equals to 9 in MEXaction2.

For SSAD model, we use all 38 untrimmed video in MEXaction2 training set training set. Since the distribution of action instances’ length in MEXaction2 is similar with THUMOS’14, we also set the interval of snippets to zero and the window length $T_{w}$ to 512. We train all layers of SSAD for 10 epochs with learning rate of 0.0001.

We compare SSAD with SCNN (Shou et al., 2016) and typical dense trajectory features (DTF) based method (mex, 2015). Both results are provided by (Shou et al., 2016). Comparison results are shown in Table 2, our SSAD network achieve significant performance gain in all action categories of MEXaction2 and the mAP is increased from $7.4\%$ to $11.0\%$ with overlap threshold 0.5. Figure 6 shows the visualization of prediction results for two action categories respectively.

4. Model Analysis

We evaluate SSAD network with different variants in THUMOS’14 to study their effects, including action classifiers, architectures of SSAD network and post-processing strategy.

Action classifiers. Action classifiers are used to extract SAS feature. To study the contribution of different action classifiers, we evaluate them individually and coherently with IoU threshold 0.5. As shown in Table 3, two-stream networks show better performance than C3D network and the combination of two-stream and C3D network lead to the best performance. In action recognition task such as UCF101, two-stream network (Wang et al., 2015) achieve $91.4\%$ , which is better than $85.2\%$ of C3D (Tran et al., 2015) network (without combining with other method such as iDT (Wang and Schmid, 2013)). So two-stream network can predict action categories more precisely than C3D in snippet-level, which leads to a better performance of the SSAD network. Furthermore, the SAS feature extracted by two-stream network and C3D network are complementary and can achieve better result if used together.

Architectures of SSAD network. In section 3.3, we discuss several architectures used for base network of SSAD. These architectures have same input and output size. So we can evaluate them fairly without other changes of SSAD. The comparison results are shown in Table 4. Architecture $B$ achieves best performance among these configurations and is adopted for SSAD network. We can draw two conclusions from these results: (1) it is better to use max pooling layer instead of temporal convolutional layer to shorten the length of feature map; (2) convolutional layers with kernel size 9 have better performance than other sizes.

Post-processing strategy. We evaluate multiple post-processing strategies. These strategies differ in the way of late fusion to generate $\bm{p_{final}}$ and are shown in Table 5. For example, $\bm{p_{class}}$ is used for generate $\bm{p_{final}}$ if it is ticked in table. Evaluation results are shown in Table 5. For the categories score, we can find that $\bm{p_{class}}$ has better performance than $\bm{\bar{p}}_{sas}$ . And using the multiplication factor $p_{over}$ can further improve the performance. SSAD network achieves the best performance with the complete post-processing strategy.

Conclusion

In this paper, we propose the Single Shot Action Detector (SSAD) network for temporal action detection task. Our SSAD network drops the proposal generation step and can directly predict action instances in untrimmed video. Also, we have explored many configurations of SSAD network to make SSAD network work better for temporal action detection. When setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD significantly outperforms other state-of-the-art systems by increasing mAP from $19.0\%$ to $24.6\%$ on THUMOS’14 and from $7.4\%$ to $11.0\%$ on MEXaction2. In our approach, we conduct feature extraction and action detection separately, which makes SSAD network can handle concise high-level features and be easily trained. A promising future direction is to combine feature extraction procedure and SSAD network together to form an end-to-end framework, so that the whole framework can be trained from raw video directly.