TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals

Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, Ram Nevatia

Introduction

We address the problem of generating Temporal Action Proposals (TAP) in long untrimmed videos, akin to generation of object proposals in images for rapid object detection. As in the case for objects, the goal is to make action proposals have high precision and recall, while maintaining computational efficiency.

There has been considerable work in action classification task where a “trimmed” video is classified into one of specified categories . There has also been work on localizing the actions in a longer, “untrimmed” video , i.e. temporal action localization. A straightforward way to use action classification techniques for localization is to use temporal sliding windows, however there is a trade-off between density of the sliding windows and computation time. Taking cues from the success of proposal frameworks in object detection tasks , there has been recent work for generating temporal action proposals in videos to improve the precision and accelerate the speed of temporal localization.

State-of-the-art methods formulate TAP generation as a binary classification problem (i.e. action vs. background) and apply sliding window approach as well. Denser sliding windows usually would lead to higher recall rates at the cost of computation time. Instead of basing on sliding windows, Deep Action Proposals (DAPs) uses a Long Short-term Memory (LSTM) network to encode video streams and infer multiple action proposals inside the streams. However, the performance of average recall (AR), which is computed by the average of recall at temporal intersection over union (tIoU) between 0.5 and 1, suffers at small number of predicted proposals compared with the sliding window based method Newly released evaluation results from DAPs authors show that SCNN-prop outperforms DAPs..

To achieve high temporal localization accuracy and efficient computation cost, we propose to use temporal boundary regression. Boundary regression has been a successful practice for object localization, as in . However, temporal boundary regression for actions has not been attempted in the past work.

We present a novel method for fast TAP generation: Temporal Unit Regression Network (TURN). A long untrimmed video is first decomposed into short (e.g. 16 or 32 frames) video units, which serve as basic processing blocks. For each unit, we extract unit-level visual features using off-the-shelf models (C3D and two-stream CNN model are evaluated) to represent video units. Features from a set of contiguous units, called a clip, are pooled to create clip features. Multiple temporal scales are used to create a clip pyramid. To provide temporal context, clip-level features from the internal and surrounding units are concatenated. Each clip is then treated as a proposal candidate and TURN outputs a confidence score, indicating whether it is an action instance or not. In order to better estimate the action boundary, TURN outputs two regression offsets for the starting time and ending time of an action in the clip. Non-maximum suppression (NMS) is then applied to remove redundant proposals. The source code is available at https://github.com/jiyanggao/TURN-TAP. DAPs and Sparse-prop use Average Recall vs. Average Number of retrieved proposals (AR-AN) to evaluate the TAP performance. There are two issues with AR-AN metric: (1) the correlation between AR-AN of TAP and mean Average Precision (mAP) of action localization was not explored ; (2) the average number of retrieved proposals is related to average video length of the test dataset, which makes AR-AN less reliable when evaluating across different datasets. Spatio-temporal action detection used Recall vs. Proposal Number (R-N), however this metric does not take video lengths into consideration.

There are two criteria for a good metric: (1) it should be capable of evaluating the performance of different methods on the same dataset effectively; (2) it should be capable of evaluating the performance of the same method across different datasets (generalization capability). We should expect better TAP would lead to better localization performance, using the same localizer. We propose a new metric, Average Recall vs. Frequency of retrieved proposals (AR-F), for TAP evaluation. In Section 4.2, we validate that the proposed method satisfies the two criteria by quantitative correlation analysis between TAP performance and action localization performance.

We test TURN on THUMOS-14 and ActivityNet for TAP generation. Experimental results show that TURN outperforms the previous state-of-the-art methods by a large margin under AR-F and AR-AN. For run-time performance, TURN runs at over 880 frames per second (FPS) with C3D features and 260 FPS with flow CNN features on a single TITAN X GPU. We further plug TURN as a proposal generation step in existing temporal action localization pipelines, and observe an improvement of mAP from state-of-the-art 19% to 25.6% (at tIoU=0.5) on THUMOS-14 by changing only the proposals. State-of-the-art localization performance is also achieved on ActivityNet. We show state-of-the-art performance on generalization capability by training TURN on THUMOS-14 and transfer it to ActivityNet without fine-tuning, strong generalization capability is also shown by test TURN across different subsets in ActivityNet without fine-tuning.

In summary, our contributions are four-fold:

(1) We propose a novel architecture for temporal action proposal generation using temporal coordinate regression.

(2) Our proposed method achieves high efficiency (>800 fps) and outperforms previous state-of-the-art methods by a large margin.

(3) We show state-of-the-art generalization performance of TURN across different action datasets without dataset specific fine-tuning.

(4) We propose a new metric, AR-F, to evaluate the performance of TAP and compare AR-F with AR-AN and AR-N by quantitative analysis.

Related Work

Temporal Action Proposal. Sparse-prop proposes the use of STIPs and dictionary learning for class-independent proposal generation. S-CNN presents a two-stage action localization system, in which the first stage is temporal proposal generation, and shows the effectiveness of temporal proposals for action localization. S-CNN’s proposal network is based on fine-tuning 3D convolutional networks (C3D) to binary classification task. DAPs adopts LSTM networks to encode a video stream and produce proposals inside the video stream.

Temporal Action Localization. Based on the progress of action classification, temporal action localization has been received much attentions recently. Ma et al. address the problem of early action detection. They propose to train a LSTM network with ranking loss and merge the detection spans based on the frame-wise prediction scores generated by the LSTM. Singh et al. extend two-stream framework to multi-stream bi-directional LSTM networks and achieved state-of-the-art performance on MPII-Cooking dataset . Sun et al. transfer knowledge from web images to address temporal localization in untrimmed web videos. S-CNN presents a two-stage action localization framework: first using proposal networks to generate temporal proposals and then score the proposals with localization networks, which is trained with classification and localization loss.

Spatio-temporal Action Localization. A handful of efforts have been seen in spatio-temporal action localization. Gkioxari et al. extract proposals from RGB images with SelectiveSearch and then apply R-CNN on both RGB and optical flow images for action detection. Weinzaepfel et al. replace SelectiveSearch with EdgeBoxes . Mettes et al. propose to use sparse points as supervision for action detection to save tedious annotation work.

Object Proposals and Detection. Object proposal generation methods can be classified into two types based the features they use. One relies on hand-crafted low-level visual features, such as SelectiveSearch and Edgebox . R-CNN and Fast R-CNN are built on this type of proposals. The other type is based on deep ConvNet feature maps, such as RPNs , which introduces the use of anchor boxes and spatial regression for object proposal generation. YOLO and SSD divide images into grids and regress object bounding boxes based on the grid cells. Bounding box coordinate regression is a common design shared in second type of object proposal frameworks. Inspired by object proposals, we adopt temporal regression in action proposal generation task.

Methods

In this section, we will describe the Temporal Unit Regression Network (TURN) and the training procedure.

As we discussed before, the large-scale nature of video proposal generation requires the solution to be computationally efficient. Thus, extracting visual feature for the same window or overlapped windows repeatedly should be avoided. To accomplish this, we use video units as the basic processing units in our framework. A video VV contains TT frames, V={ti}1TV=\{t_{i}\}_{1}^{T}, and is divided into T/nuT/n_{u} consecutive video units , where nun_{u} is the frame number of a unit. A unit is represented as u={ti}sfsf+nuu=\{t_{i}\}_{s_{f}}^{s_{f}+n_{u}}, where sfs_{f} is the starting frame, sf+nus_{f}+n_{u} is the ending frame. Units are not overlapped with each other.

Each unit is processed by a visual encoder EvE_{v} to get a unit-level representation fu=Ev(u)f_{u}=E_{v}(u). In our experiments, C3D , optical flow based CNN model and RGB image CNN model are investigated. Details are given in Section 4.2.

2 Clip Pyramid Modeling

A clip (i.e. window) cc is composed of units, c={uj}susu+ncc=\{u_{j}\}_{s_{u}}^{s_{u}+n_{c}}, where sus_{u} is the index of starting unit and ncn_{c} is the number of units inside cc. eu=su+nce_{u}=s_{u}+n_{c} is the index of ending unit, and {uj}sueu\{u_{j}\}_{s_{u}}^{e_{u}} is called internal units of cc. Besides the internal units, context units for cc are also modeled. {uj}sunctxsu\{u_{j}\}_{s_{u}-n_{ctx}}^{s_{u}} and {uj}eueu+nctx\{u_{j}\}_{e_{u}}^{e_{u}+n_{ctx}} are the context before and after cc respectively, nctxn_{ctx} is the number of units we consider for context. Internal feature and context feature are pooled from unit features separately by a function PP. The final feature fcf_{c} for a clip is the concatenation of context features and the internal features; fcf_{c} is given by

where \parallel represents vector concatenation and mean pooling is used for PP. We scan an untrimmed video by building window pyramids at each unit position, i.e. an anchor unit. A clip pyramid pp consists of temporal windows with different temporal resolution, p={cnc},nc{nc,1,nc,2,...}p=\{c^{n_{c}}\},n_{c}\in\{n_{c,1},n_{c,2},...\}. Note that, although multi-resolution clips would have temporal overlaps, the clip-level features are computed from unit-level features, which are only calculated once.

3 Unit-level Temporal Coordinate Regression

The intuition behind temporal coordinate regression is that human can infer the approximate start and end time of an action instance (e.g. shooting basketball, swing golf) without watching the entire instance, similarly, neural networks might also be able to infer the temporal boundaries. Specifically, we design a unit regression model that takes a clip-level representation fcf_{c} as input, and have two sibling output layers. The first one outputs a confidence score indicating whether the input clip is an action instance. The second one outputs temporal coordinate regression offsets. The regression offsets are

where sclips_{clip}, eclipe_{clip} is the index of starting unit and ending unit of the input clip; sgts_{gt}, egte_{gt} is the index of starting unit and ending unit of the matched ground truth.

There are two salient aspects in our coordinate regression model. First, instead of regressing the temporal coordinates at frame-level, we adopt unit-level coordinate regression. As the basic unit-level features are extracted to encode nun_{u} frames, the feature may not be discriminative enough to regress the coordinates at frame-level. Comparing with frame-level regression, unit-level coordinate regression is easier to learn and more effective. Second, in contrast to spatial bounding box regression, we don’t use coordinate parametrization. We directly regress the offsets of the starting unit coordinates and the ending unit coordinates. The reason is that objects can be re-scaled in images due to camera projection, so the bounding box coordinates should be first normalized to some standard scale. However, actions’ time spans can not be easily rescaled in videos.

4 Loss Function

For training TURN, we assign a binary class label (of being an action or not) to each clip (generated at each anchor unit). A positive label is assigned to a clip if: (1) the window clip with the highest temporal Intersection over Union (tIoU) overlaps with a ground truth clip; or (2) the window clip has tIoU larger than 0.5 with any of the ground truth clips. Note that, a single ground truth clip may assign positive labels to multiple window clips. Negative labels are assigned to non-positive clips whose tIoU is equal to 0.0 (i.e. no overlap) for all ground truth clips. We design a multi-task loss LL to jointly train classification and coordinates regression.

where LclsL_{cls} is the loss for action/background classification, which is a standard Softmax loss. LregL_{reg} is for temporal coordinate regression and λ\lambda is a hyper-parameter. The regression loss is

L1L1 distance is adopted. lil^{*}_{i} is the label, 11 for positive samples and for background samples. NposN_{pos} is the number of positive samples. The regression loss is calculated only for positive samples.

During training, the background to positive samples ratio is set to be 10 in a mini-batch. The learning rate and batch size are set as 0.005 and 128 respectively. We use the Adam optimizer to train TURN.

Evaluation

In this section, we introduce the evaluation metrics, experimental setup and discuss the experimental results.

We consider three different metrics to assess the quality of TAP, the major difference is in the way to consider the retrieve number of proposals: Average Recall vs. Number of retrieved proposals (AR-N) , Average Recall vs. Average Number of retrieved proposals (AR-AN) , Average Recall vs. Frequency of retreived proposals (AR-F). Average Recall (AR) is calculated as a mean value of recall rate at tIoU between 0.5 and 1.

AR-N curve. In this metric, the numbers of retrieved proposals (N) for all test videos are the same. This curve plots AR versus number of retrieved proposals.

AR-AN curve. In this metric, AR is calculated as a function of average number of retrieved proposals (AN). AN is calculated as: Θ=ρΦ,ρ(0,1]\overline{\Theta}=\rho\overline{\Phi},\rho\in(0,1]. In which, Φ=1ni=1nΦi\overline{\Phi}=\frac{1}{n}\sum_{i=1}^{n}\Phi_{i} is the average number of all proposals of test videos. ρ\rho is the ratio of picked proposals to evaluate. nn is the number of test videos and Φi\Phi_{i} is the number of all proposals for each video. By scanning the ratio ρ\rho from 0 to 1, the number of retrieved proposals in each video varies from 0 to number of all proposals and thus the average number of retrieved proposals also varies.

AR-F curve. This is the new metric that we propose. We measure average recall as a function of proposal frequency (F), which denotes the number of retrieved proposals per second for a video. For a video of length lil_{i} and proposal frequency of FF, the retrieved proposal number of this video is Ri=FliR_{i}=Fl_{i}.

We also report Recall@X-tIoU curve: recall rate at X with regard to different tIoU. X could be number of retrieved proposals (N), average number of retrieved proposals (AN) and proposal frequency (F).

For the evaluation of temporal action localization, we follow the traditional mean Average Precision (mAP) metric used in THUMOS-14 and ActivityNet. A prediction is regarded as positive only when it has correct category prediction and tIoU with ground truth higher than a threshold. We use the official evaluation toolkit provided by THUMOS-14 and ActivityNet.

2 Experiments on THUMOS-14

Datasets. The temporal action localization part of THUMOS-14 contains over 20 hours of videos from 20 sports classes. This part consists of 200 videos in validation set and 213 videos in test set. TURN model is trained on the validation set, as the training set of THUMOS-14 contains only trimmed videos.

Experimental setup. We perform the following experiments: (1) different temporal proposal evaluation metrics are compared; (2) the performance of TURN and other TAP generation methods are compared under evaluation metrics (i.e AR-F and AR-AN) mentioned above; (3) different TAP generation methods are compared on the temporal action localization task with the same localizer/classifier. Specifically, we feed the proposals into a localizer/classifier, which outputs the confidence scores of 21 classes (20 classes of action plus background). Two localizer/classifiers are adopted: (a) SVM classifiers: one-vs-all linear SVM classifiers are trained for all 21 classes using C3D fc6 features; (b) S-CNN localizer: the pre-trained localization network of S-CNN is adopted.

For TURN model, the context unit number nctxn_{ctx} is 4, λ\lambda is 2.0, the dimension of middle layer fmf_{m} is 1000, temporal window pyramids is built with {1,2,4,8,16,32}\{1,2,4,8,16,32\} units. We test TURN with different unit sizes nu{16,32}n_{u}\in\{16,32\}, and different unit features, including C3D , optical flow based CNN feature and RGB CNN feature . The NMS threshold is set to be 0.1 smaller than tIoU in evaluation. We implement TURN model in Tensorflow .

Comparison of different evaluation metrics. To validate the effectiveness of different evaluation metrics, we compare AR-F, AR-N, AR-AN by a correlation analysis with localization performance (mAP). We generate seven different sets of proposals, including random proposals, slidinig windows and variants of S-CNN proposals (details are given in the supplementary material). We then test the localization performance using the proposals, as shown in Figure 3 (a)-(c). SVM classifiers are used for localization.

A detailed analysis of correlation and video length is given in Figure 3 (d). The test videos are sorted by video lengths and then divided evenly into four groups. The average video length of the group is the x-axis, and y-axis represents the correlation coefficient between action localization performance and TAP performance of the group. Each point in 3 (d) represents the correlation of TAP and localization performance of one group under different evaluation metrics. As can be observed in Figure 3, the correlation coefficient between mAP and AR-F is consistently higher than 0.9 at all video lengths. In contrast, correlation of AR-N and mAP is affected by video length distribution. Note that, AR-AN also shows a stable correlation with mAP, this is partially because the TAP generation methods we use generate proportional numbers of proposals to video length.

To assess generalization, assume that we have two different datasets, S0S_{0} and S1S_{1}, whose average number of all proposals are Φ0\overline{\Phi}_{0} and Φ1\overline{\Phi}_{1} respectively. As introduced before, average number of retrieved proposals Θ=ρΦ,ρ(0,1]\overline{\Theta}=\rho\overline{\Phi},\rho\in(0,1] is dependent on Φ\Phi. When we compare AR at some certain AN=ΘxAN=\overline{\Theta}_{x} between S0S_{0} and S1S_{1}, as Φ0\overline{\Phi}_{0} and Φ1\overline{\Phi}_{1} are different, we need to set different ρ0\rho_{0} and ρ1\rho_{1}. It means that the ratios between retrieved proposals and all generated proposals are different for S0S_{0} and S1S_{1}, which make the AR calculated for S0S_{0} and S1S_{1} at the same AN=ΘxAN=\overline{\Theta}_{x} can not be compared directly. For AR-F, the number of proposals retrieved is based on “frequency”, which is independent with the average number of all generated proposals.

In summary, AR-N cannot evaluate TAP performance effectively on the same dataset, as number of retrieved proposals should vary with video lengths. AR-AN cannot be used to compare TAP performance among different datasets, as the retrieval ratio depends on dataset’s video length distribution, which makes the comparison unreasonable. AR-F satisfies both requirements.

Comparison of visual features. We test TURN with three unit-level features to assess the effect of visual features on AR performance: C3D features, RGB CNN features with temporal mean pooling and dense flow CNN features. The C3D model is pre-trained on Sports1m , all 16 frames in a unit are input into C3D and the output of fcfc6 layer is used as unit-level feature. For RGB CNN features, we uniformly sample 8 frames from a unit, extract “Flatten_673” features using a ResNet model (pre-trained on training set of ActivityNet v1.3 dataset ) and compute the mean of these 8 features as the unit-level feature. For dense flow CNN features, we sample 66 consecutive frames at the center of a unit and calculate optical flow between them. The flows are then fed into a BN-Inception model that is pre-trained on training set of ActivityNet v1.3 dataset . The output of “global pool” layer of BN-Inception is used as the unit-level feature.

As shown in Figure 4, dense flow CNN feature (TURN-FL) gives the best results, indicating optical flow can capture temporal action information effectively. In contrast, RGB CNN features (TURN-RGB) show inferior performance and C3D (TURN-C3D) gives competitive performance.

Temporal context and unit-level coordinate regression. We compare four variants of TURN to show the effectiveness of temporal context and unit regression: (1) binary cls w/o ctx: binary classification (no regression) without the use of temporal context, (2) binary cls w/ ctx: binary classification (no regression) with the use of context, (3) frame reg w/ ctx: frame-level coordinate regression with the use of context and (4) unit reg w/ ctx: unit-level coordinate regression with the use of context (i.e. our full model). The four variants are compared with AR-F curves. As shown in Figure 4, temporal context helps to classify action and background by providing additional information. As shown in AR-F curve, unit reg w/ ctx has higher AR than the other variants at all frequencies, indicating that unit-level regression can effectively refine the proposal location. Some TURN proposal results are shown in Figure 6.

Comparison with state-of-the-art. We compare TURN with the state-of-the-art methods under AR-AN, AR-F, Recall@AN-tIoU, Recall@F-tIoU metrics. The TAP generation methods include DAPs , SCNN-prop , Sparse-prop , sliding window, and random proposals. For DAPs, Sparse-prop and SCNN-prop, we plot the curves using the proposal results provided by the authors. “Sliding window proposals” include all sliding windows of length from 16 to 512 overlapped by 75%, each window is assigned with a random score. “Random proposals” are generated by assigning random starting and ending temporal coordinates (ending temporal coordinate is larger than starting temporal coordinate), each random window is assigned with a random score. As shown in Figure 5, TURN outperforms the state-of-the-art consistently by a large margin under all four metrics.

How unit size affects AR and run-time performance? The impact of unit size on AR and computation speed is evaluated with nu{16,32}n_{u}\in\{16,32\}. We keep other hyper-parameters the same as in Section 4.2. Table 1 shows comparison of the three TURN variants (TURN-FL-16, TURN-FL-32, TURN-C3D-16) and three state-of-the-art TAP methods, in terms of recall (AR@F=1.0) and run-time (FPS) performance. We randomly select 100 videos from THUMOS-14 validation set and run TURN-FL-16, TURN-FL-32 and TURN-C3D-16 on a single Nvidia TITAN X GPU. The run-time of DAPs and SCNN-prop are provided in , which were tested on a TITAN X GPU and a GTX 980 GPU respectively. The hardware used in is not specified in the paper.

As can be seen, there is a trade-off between AR and FPS: smaller unit size leads to higher recall rate, and also higher computational complexity. We consider unit size as temporal coordinate precision, for example, unit size of 16 and 32 frames represent approximately half second and one second respectively. The major part of computation time comes from unit-level feature extraction. Smaller unit size leads to more number of units, which increases computation time; on the other hand, smaller unit size also increases temporal coordinate precision, which improves the precision of temporal regression. C3D feature is faster than flow CNN feature, but with a lower performance. Compared with state-of-the-art methods, we can see that TURN-C3D-16 outperforms current state-of-the-art AR performance, but accelerates computation speed by more than 6 times. TURN-FL-16 achieves the highest AR performance with competitive run-time performance.

TURN for temporal action localization. We feed proposal results of different TAP generation methods into the same temporal action localizers/classifiers to compare the quality of proposals. The value of mAP@tIoU=0.5 is reported in Table 2. TURN outperforms all other methods in both the SVM classifier and S-CNN localizer. Sparse-prop, SCNN-prop and DAPs all use C3D to extract features. It is worth noting that the localization results of four different proposals suit well with their proposal performance under AR-F metric in Figure 5: the methods that have better performance under AR-F achieve higher mAP in temporal action localization.

A more detailed comparison of state-of-the-art localization methods is given in Table 3. It can be seen that, by applying TURN with linear SVM classifiers for action localization, we achieve comparable performance with the state-of-the-art methods. By further incorporating S-CNN localizer, we outperform all other methods by a large margin at all tIoU thresholds. The experimental results prove the high-quality of TURN proposals.

TURN helps action localization on two aspects: (1) TURN serves as the first stage of a localization pipeline (e.g. S-CNN, SVM) to generate high-quality TAP, and thus increases the localization performance; (2) TURN accelerates localization pipelines by filtering out many background segments, thus reducing the unnecessary computation.

3 Experiments on ActivityNet

Datasets. ActivityNet datasets provide rich and diverse action categories. There are three releases of ActivityNet dataset: v1.1, v1.2 and v1.3. All three versions define a 5-level hierarchy of action classes. Nodes on higher level represent more abstract action categories. For example, the node “Housework” on level-3 has child nodes “Interior cleaning”, “Sewing, repairing, & maintaining textiles” and “Laundry” on level-4. From the hierarchical action categories definition, a subset can be formed by including all action categories that belong to a certain node.

Experiment setup. To compare with previous work, we do experiments on v1.1 (on subsets of “Works” and “Sports”) for temporal action localization , v1.2 for proposal generalization capability following the same evaluation protocol as in . On v1.3, we design a different experimental setup to test TURN’s cross-domain generalization capability: four subsets having distinct semantic meanings are selected, including “Participating in Sports, Exercise, or Recreation”, “Vehicles”, “Housework” and “Arts and Entertainment”. We also check that the action categories in different subsets are not semantically related: for example, ”archery”, ”dodge ball” in “Sports” subset, ”changing car wheels”, ”fixing bicycles” in “Vehicles” subset, ”vacuuming floor”, ”cleaning shoes” in “Housework” subset, ”ballet”, ”playing saxophone” in “Arts” subset.

The evaluation metrics include AR@AN curve for temporal action proposal and mAP for action localization. AR@F=1.0 is reported for comparing proposal performance on different subsets. The validation set is used for testing as the test set is not publicly available.

To train TURN, we set the number of frames in a unit nun_{u} to be 16, the context unit number nctxn_{ctx} to be 4, LL to be 6 and λ\lambda to be 2.0. We build the temporal window pyramid with {2,4,8,16,32,64,128}\{2,4,8,16,32,64,128\} number of units. The NMS threshold is set to be 0.1 smaller than tIoU in evaluation. For the temporal action localizer, SVM classifiers are trained with two-stream CNN features in “Sports” and “Works” subsets.

Generalization capability of TURN. One important property of TAP is the expectation to generalize beyond the categories it is trained on.

On ActivityNet v1.2, we follow the same evaluation protocol from : model trained on THUMOS-14 validation set and tested in three different sets of ActivityNet v1.2: the whole set of ActivityNet v1.2 (all 100 categories), ActivityNet v1.2 \cap THUMOS-14 (on 9 categories shared between the two) and ActivityNet v1.2 \leqslant 1024 frames (videos with unseen categories with annotations up to 1024 frames). To avoid any possible dataset overlap and enable direct comparison, we use C3D (pre-trained on Sports1M) as feature extractor, the same as DAPs did. As shown in Figure 7, TURN has better generalization capability in all three sets.

On ActivityNet v1.3, we implement a different setup for evaluating generalization capability on subsets that contain semantically distinct actions: (1) we train TURN on one subset and test on the other three subsets, (2) we train on the ensemble of all 4 subsets and test on each subset. TURN is trained with C3D unit features, to avoid any overlap of training data. We also report performance of sliding windows (lengths of 32, 64, 128, 256, 512, 1024 and 2048, overlap 50% ) in each subset. Average recall at frequency 1.0 (AR@F=1.0) are reported in Table 4. The left-most column lists subsets used for training. The numbers of action classes and training videos with each subset are shown in brackets. The top row lists subsets for test. The off-diagonal elements indicate that the training data and test data are from different subsets; the diagonal elements indicate the training data and test data are from the same subsets.

As can be seen in Table 4, the overall generalization capability is strong. Specifically, the generalization capability when training on “Sports” subset is the best compared with other subsets, which may indicate that more training data would lead to better generalization performance. The “Ensemble” row shows that using training data from other subsets would not harm the performance of each subset.

TURN for temporal action localization. Temporal action localization performance is evaluated and compared on “Works” and “Sports” subsets of ActivityNet v1.1. TURN trained with dense flow CNN features is used for comparison. On v1.1, TURN-FL-16 proposal is fed into one-vs-all SVM classifiers which trained with two-stream CNN features. From the results shown in Table 5, we can see that TURN proposals improve localization performance.

Conclusion

We presented a novel and effective Temporal Unit Regression Network (TURN) for fast TAP generation. We proposed a new metric for TAP: Average Recall-Proposal Frequency (AR-F). AR-F is robustly correlated with temporal action localization performance and it allows performance comparison among different datasets. TURN can runs at over 880 FPS with the state-of-the-art AR performance. TURN is robust on different visual features, including C3D and dense flow CNN features. We showed the effectiveness of TURN as a proposal generation stage in localization pipelines on THUMOS-14 and ActivityNet.

References