PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding

Chunhui Liu, Yueyu Hu, Yanghao Li, Sijie Song, Jiaying Liu

Introduction

The tremendous success of deep learning have made data-driven learning methods get ahead with surprisingly superior performance for many computer vision tasks. Thus, several famous large scale datasets have been collected to boost the research in this area . ActivityNet is a superior RGB video dataset gathered from Internet media like YouTube with well annotated label and boundaries.

Thanks to the prevalence of the affordable color-depth sensing cameras like Microsoft Kinect, and the capability to obtain depth data and the 3D skeleton of human body on the fly, 3D activity analysis has drawn great attentions. As an intrinsic high level representation, 3D skeleton is valuable and comprehensive for summarizing a series of human dynamics in the video, and thus benefits the more general action analysis. Besides succinctness and effectiveness, it has a significant advantage of great robustness to illumination, clustered background, and camera motion. However, as a kind of popular data modality, 3D action analysis suffers from the lack of large-scale benchmark datasets. To the best of our knowledge, existing 3D action benchmarks have limitations in two aspects.

$\bullet$ Shortage in large action detection datasets: Action detection plays an important role in video analytics and can be effectively studied through analysis and learning from massive samples. However, most existing 3D datasets mainly target at the task of action recognition for segmented videos. There is a lack of large scale multi-modal dataset for action detection. Additionally, previous detection benchmarks only contain a small number of actions in each video even in some large scale RGB datasets . There is no doubt that more actions within one untrimmed video will promote the robustness of action detection algorithms based on the sequential action modeling and featuring.

$\bullet$ Limitation in data modalities: Different modalities (e.g. optical flow, RGB, infrared radiation, and skeleton) intuitively capture features in different aspects and provide complementary information. For example, RGB frames can deliver appearance information but lack in motion representation, while optical flow is capable of describing motion but misses depth information which can be provided in skeleton. The combination of multi-modal data would benefit the application on action recognition and temporal localization. Traditional datasets focus mainly on one modality of action representation. Thus, it is worth exploiting multi-modal data with elegant algorithms for action analysis.

To overcome these limitations, we develop a new large scale continuous multi-modality 3D human activity dataset (PKU-MMD) for facilitating further study on human activity understanding, especially action detection. As shown in Figure 1, our dataset contains 1076 videos composed by 51 action categories, and each of the video contains more than twenty action instances performed by 66 subjects in 3 camera views. The total number of our dataset is 3,000 minutes and 5,400,000 frames. We provide four raw modalities: RGB frame, depth map, skeleton data, and infrared. More modalities can be further calculated such as optical flow and motion vector.

Besides, we propose a new 2D protocol to evaluate the precision-recall curve of each method in a much straightforward manner. Taking over-lapping ratio and detection confidence into account jointly, each algorithm can be evaluated with a single value, instead of a list of mean average precisions with corresponding overlap ratios. Several experiments are implemented to test both the capabilities of different approaches for action detection and the combination performance of different modalities.

Related Work

In this section, we briefly summarize the development of activity analysis. As a part of pattern recognition, activity analysis shows a common way of development in machine learning, where large scale benchmarks share familiar significance with magnificent methods. Here, we briefly introduce a series of benchmarks and approaches. For a more extensive conclusion of activity analysis we refer to corresponding survey papers .

Early activity analysis mainly focuses on action recognition which consists of a classification task for segmented videos. Traditional methods mainly focus on hand-crafting features for video representation. Densely tracking points in the optical flow field with more features like Histogram of Oriented Gradient (HOG), Histogram of Flow (HOF) and Motion Boundary Histograms (MBH) encoded by Fisher Vector achieved a good performance. Recently, deep learning has been exploited for action recognition . Deep approaches automatically learn robust feature representations directly from raw data and recognize actions synchronously with deep neural networks . To model temporal dynamics, Recurrent Neural Network (RNN) have also been exploited for action recognition. In , CNN layers are constructed to extract visual features while the followed recurrent layers are applied to handle temporal dynamics.

For action detection, existing methods mainly utilize either sliding-window scheme , or action proposal approaches . These methods usually have low computational efficiency or unsatisfactory localization accuracy due to the overlapping design and unsupervised localization approach. Most methods are designed for offline action detection . However, in many new works, recognizing the actions on the fly before the completion of the action is well studied by a learning formulation based on a structural SVM , or a non-parametric moving pose framework and a dynamic integral bag-of-words approach . LSTM is also used for online action detection and forecast which provides frame-wise class information. It forecasts the occurrence of start and end of actions.

As the fundamental requirement of research, videos source also determines the branches of action analysis. Early action analysis dataset mainly focuses on home surveillance activities like drinking or waving hands. The analysis of those simple indoor activities are the start of action recognition process. The advantages of this kind of videos lie in that they are usually easy and cheap to capture. However, collecting a large scale benchmark with cameras can be troublesome. Fortunately, the rapid development of Internet technology and data mining algorithms enable a new approach of collecting dataset from Internet third-way media like YouTube . As a result, RGB-based datasets achieve a grant level with hundreds of action labels and video sources in TB level. Recently, there are also several works focus on collecting different datasets of action type like TV-series , Movies and Olympic Games .

With the launch of Microsoft Kinect, the diversity of action source becomes possible. Different input sources have been discussed such as Depth data and Skeleton data. Depth data provides a 3D information which is beneficial for action understanding. Skeleton, as a kind of high level representation of human body, can provide valuable and condensed information for recognizing actions. As Kinect devices provide a real-time algorithm to generate skeleton data from the information of RBG, depth, and infrared, skeleton becomes an ideal source to support real-time algorithm and to be transferred and utilized on some mobile devices like robots or telephones.

Despite of the diversity of source, action understanding still faces several problems, among which the top priority is the accuracy problem. Another problem is the poor performance of cross-data recognition. That is, existing approaches or machine learning models achieve good performances with training and test sets in similar environments conditions. Open domain action recognition and detection is still challenging.

2 3D Activity Understanding Approaches

For skeleton-based action recognition, many generative models have been proposed with superior performance. Those methods are designed to capture local features from the sequences and then to classify them by traditional classifiers like Support Vector Machine (SVM). Those local features includes rotations and translations to represent geometric relationships of body parts in a Lie group , or the covariance matrix to learn the co-occurrence of skeleton points . Additionally, Fourier Temporal Pyramids (FTP) or Dynamic Time Warping (DTW) are also employed to temporally align the sequences and to model temporal dynamics. Furthermore, many methods divide the human body into several parts and learn the co-occurrence information, respectively. A Moving Pose descriptor is proposed to mine key frames temporally via a k-NN approach in both pose and atomic motion features.

Most methods mentioned above focus on designing specific hand-crafted features and thus being limited in modeling temporal dynamics. Recently, deep learning methods are proposed to learn robust feature representations and to model the temporal dynamics without segmentation. In , a hierarchical RNN is utilized to model the temporal dynamics for skeleton based action recognition. Zhu et al. proposed a deep LSTM network to model the inherent correlations among skeleton joints and the temporal dynamics in various actions. However, there are few approaches proposed for action detection on 3D skeleton data. Li et al. introduced a Joint Classification Regression RNN to avoid sliding window design which demonstrates state-of-the-art performance for online action detection. In this work, we propose a large-scale detection benchmark to promote the study on continuous action understanding.

3 3D Activity Datasets

We have also surveyed other tens of well-designed action datasets which greatly improved the study of 3D action analysis. These datasets have promoted the construction of standardized protocols and evaluations of different approaches. Furthermore, they often provide some new directions in action recognition and detection previously unexplored. A comparison among several datasets and PKU-MMD is given in Table 1.

MSR Action3D dataset is one of the earliest datasets for 3D skeleton based activity analysis. This dataset is composed by instances chosen in the context of interacting with game consoles like high arm wave, horizontal arm wave, hammer, and hand catch. The FPS is 15 frames per second and the skeleton data includes 3D locations of 20 joints.

G3D is designed for real-time action recognition in gaming containing synchronized videos. As the earliest activity detection dataset, most sequences of G3D contain multiple actions in a controlled indoor environment with a fixed camera, and a typical setup for gesture based gaming.

CAD-60 & CAD-120 are two special multi-modality datasets. Compared to CAD-60, CAD-120 provides extra labels of temporal locations. However, the limited number of video instants is their downside.

ACT4 is a large dataset designed to facilitate practical applications in real life. The action categories in ACT4 mainly focus on the activities of daily livings. Its drawback is the limited modality.

Multiview 3D event and Northwestern-UCLA datasets start to use multi-view method to capture the 3D videos. This method is widely utilized in many 3D datasets.

Watch-n-Patch and Compostable Activities are the first datasets focusing on the continues sequences and the inner combination of activities in supervised or unsupervised methods. Those consist of moderate number of action instances. Also, the number of instance actions in one video is limited and thus cannot fulfill the basic requirement for deep network training.

NTU RGB+D is a state-of-the-art large-scale benchmark for action recognition. It illustrates a series of standards and experience for large-scale data building. Recently reported results on this dataset have achieved agreeable accuracy on this benchmark.

OAD dataset is a new dataset focusing on online action detection and forecast. 59 videos were captured by Kinect v2.0 devices which composed of daily activities. This dataset proposes a series of new protocols for 3D action detection and raises an online demand.

However, as the quick development of action analysis, these datasets are not able to satisfy the demand of data-driven algorithms. Therefore, we collect PKU-MMD dataset to overcome their drawbacks from the perspectives in Table 2.

The Dataset

PKU-MMD is our new large-scale dataset focusing on long continuous sequences action detection and multi-modality action analysis. The dataset is captured via the Kinect v2 sensor, which can collect color images, depth images, infrared sequences and human skeleton joints synchronously. We collect 1000+ long action sequences, each of which lasts about 3 $\sim$ 4 minutes (recording ratio set to 30 FPS) and contains approximately 20 action instances. The total scale of our dataset is 5,312,580 frames of 3,000 minutes with 20,000+ temporally localized actions.

We choose 51 action classes in total, which are divided into two parts: 41 daily actions (drinking, waving hand, putting on the glassed, etc.) and 10 interaction actions (hugging, shaking hands, etc.).

We invite 66 distinct subjects for our data collection. Each subjects takes part in 4 daily action videos and 2 interactive action videos. The ages of the subjects are between 18 and 40. We also assign a consistent ID number over the entire dataset in a similar way in .

To improve the sequential continuity of long action sequences, the daily actions are designed in a weak connection mode. For example, we design an action sequence of taking off shirt, taking off cat, drinking waterand sitting down to describes the scene that occur after going back home. Note that our videos only contain one part of the actions, either daily actions or interaction actions. We design 54 sequences and divide subjects into 9 groups, and each groups randomly choose 6 sequences to perform.

For the multi-modality research, we provide 5 categories of resources: depth maps, RGB images, skeleton joints, infrared sequences, and RGB videos. Depth maps are sequences of two dimensional depth values in millimeters. To maintain all the information, we apply lossless compression for each individual frame. The resolution of each depth frame is $512\times 424$ . Joint information consists of 3-dimensional locations of 25 major body joints for detected and tracked human bodies in the scene. We further provide the confidence of each joints point as appendix. RGB videos are recorded in the provided resolution of $1920\times 1080$ . Infrared sequences are also collected and stored frame by frame in $512\times 424$ .

2 Developing the Dataset

Building a large scale dataset for computer vision task is traditionally a difficult task. To collect untrimmed videos for detection task, the main time-consuming work is labeling the temporal boundaries. The goal of PKU-MMD is to provide a large-scale continuous multi-modality 3D action dataset, the items of which contain a series of compact actions. Thus we combine traditional recording approaches with our proposed validation methods to enhance the robustness of our dataset and improve the efficiency.

We now fully describe the collecting and labeling process for obtaining PKU-MMD dataset. Inspired by , we firstly capture long sequences from Kinect v2 sensors with a well-designed standards. Then, we rely on volunteers to localize the occurrences of dynamic and verify the temporal boundaries. Finally, we design a cross-validation system to obtain labeling correction confidence evaluation.

Recording Multi-Modality Videos: After designing several action sequences, we carefully choose a daily-life indoor environment to capture the video samples where some irrelevant variables are fully considered. Considering that the temperature changes will lead to the deviation of infrared sequences, we fully calculate the distance among the action occurrence, windows and Kinect devices. Windows are occluded for illumination consistency. We use three cameras in the fixed angle and height at the same time to capture three different horizontal views. We set up an action area with $180cm$ as length and $120cm$ as width. Each subject will perform each action instances in a long sequence toward a random camera, and it is accepted to perform two continuous actions toward different cameras. The horizontal angles of each camera is $-45^{\circ}$ , $0^{\circ}$ , and $+45^{\circ}$ , with a height of $120cm$ . An example of our multi-modality data can be found in Figure 3.

Localizing Temporal Intervals: At this stages, captured video sources are labeled on frame level. We employ volunteers to review each video and give the proposal temporal boundaries of each action presented in the long video. In order to keep high annotation quality, we merely employ proficient volunteers who have experiences in labeling temporal actions. Furthermore, there will be a deviation for the temporal labels of a same action from different persons. Thus we divide actions into several groups and the actions in each group are labeled by only one person. At the end of this process, we have a set of verified untrimmed videos that are associated to several action intervals and label correspondingly.

Verifying and Enhancing Labels: Unlike recognition task which merely need one label for an trimmed video clip, the probability of error on temporal boundaries will be much higher. Moreover, during the labeling process we observe that approximate 10-frames expansion of action interval is sometimes accepted in some instance. To further improve the robustness of our dataset, we propose a system of labeling correction confidence evaluation to verify and enhance the manual labels. Firstly, we design basic evaluation protocol of each video, like If there is overlap of actions or Is the length of an action reasonable. Thanks to multi-view capturing, we then use cross-view method to evaluate and verify the data label. The protocol guarantees the consistency of videos of each view.

Evaluation Protocols

To obtain a standard evaluation for the results on this benchmark, we define several criteria for the evaluation of the precision and recall scores in detection tasks. We propose two dataset partition settings with several precision protocols.

This section introduces the basic dataset splitting settings for various evaluation, including cross-view and cross-subject settings.

Cross-View Evaluation: For cross-view evaluation, the videos sequences from the middle and right Kinect devices are chosen for training set and the left is for testing set. Cross-view evaluation aims to test the robustness in terms of transformation (e.g.translation, rotation). For this evaluation, the training and testing sets have 717 and 359 video samples, respectively.

Cross-Subject Evaluation: In cross-subject evaluation, we split the subjects into training and testing groups which consists of 57 and 9 subjects respectively. For this evaluation, the training and testing sets have 944 and 132 long video samples, respectively. Cross-subject evaluation aims to test the ability to handle intra-class variations among different actors.

2 Average Precision Protocols

To evaluate the precision on the proposed action intervals with confidences, two tasks must be considered. One is to determine if the proposed interval is positive, and the other is to evaluate the performance of precision and recall. For the first task, there is a basic criterion to evaluate the overlapping ratio between the predicted action interval $I$ and the ground truth interval $I^{*}$ with a threshold $\theta$ . The detection interval is correct when

where $I\cap I^{*}$ denotes the intersection of the predicted and ground truth intervals and $I\cup I^{*}$ denotes their union. So, with $\theta$ , the $p(\theta)$ and $r(\theta)$ can be calculated.

F1-Score: With the above criterion to determine a correction detection, the F1-score is defined as

F1-score is a basic evaluation criterion regardless of the information of the confidence of each interval.

Interpolated Average Precision (AP): Interpolated average precision is a famous evaluation score using the information of confidence for ranked retrieval results. With confidence changing, precision and recall values can be plotted to give a precision-recall curve. The interpolated precision $p_{interp}$ at a certain recall level $r$ is defined as the highest precision found for any recall level $r^{\prime}\geq r$ :

Note that $r$ is also determined by overlapping confidence $\theta$ . The interpolated average precision is calculated by the arithmetic mean of the interpolated precision at each recall level.

Mean Average Precision (mAP): With several parts of retrieval set $Q$ , each part $q_{j}\in Q$ proposes $m_{j}$ action occurrences $\{d_{1},\ldots d_{m_{j}}\}$ and $r_{jk}$ is the recall result of ranked $k$ retrieval results, then mAP is formulated by

Note that with several parts of retrieval set $Q$ , the AP score (4) is discretely formulated.

We design two splitting protocols: mean average precision of different actions (mAPa) and mean average precision of different videos (mAPv).

2D Interpolated Average Precision: Though several protocols have been designed for information retrieval, none of them takes the overlap ratio into consideration. We can find that each AP score and mAP score is associated to $\theta$ . To further evaluate the performance of precisions of different overlap ratios, we now propose the 2D-AP score which takes both retrieval result and overlap ratio of detection into consideration:

Experiments

This section presents a series of evaluation of basic detection algorithms on our benchmark. Due to the fact that there is few implementation for 3D action detection, these evaluations also serve to illustrate the challenge activity detection is and call on new explorations.

In this part, we implement several detection approaches for the benchmarking scenarios for the comparison on PKU-MMD dataset. Because of limited approaches for detection task, our base-line methods are divided into two phases, one is video representations and the other is temporal localizing and category classifying.

In order to capture visual patterns in multi-modality input, we construct a series of video representations.

Raw Skeleton (RS): Raw skeleton can be directly considered as a representation for they containing high-order location context.

Convolution Skeleton (CS): Convolution skeleton is a new skeleton representation approach which add the temporal difference into raw skeleton with skeleton normalization. This method is illustrated in and is proven to be simple but effective.

Deep RGB (DR): For RGB-based action recognition, traditional motion features like HOG, HOF, and MBH are proven effective encoded by Fisher Vector . However, Temporal Segment Networks (TSN) have greatly improved the accuracy on several RGB-based dataset i.e. UCF101, ActivityNet. In practice, we adopt features derived from convolution networks of TSN network that have been trained for action recognition as a robust RGB-based deep network feature.

Deep Optical Flow (DOF): Optical flow is well used in event detection, as it obtains a representation of motion dynamics. We fine-tune a deep BN-Inception network to learn the high-order features for temporal and spatial dynamics. This is motivated by the versatility and robustness of optical flow based deep features which are favorable in many recognition studies.

1.2 Temporal Detection Method

Here we introduce several approaches for action detection.

Sliding Window + BLSTM/SVM: Leveraging the insight from the RGB-based activity detection approaches, we design several slide-window detection approaches. For the classifier, we choose three stacked bidirectional LSTM (BLSTM) network and SVM motivated by the effectiveness of LSTM models and the agility of SVM classifier.

Sliding Window + STA-LSTM: Spatial-temporal attention network is a state-of-the-art work proposed for action recognition with unidirectional LSTM. It proposes a regularized cross-entropy loss to drive the model learning process which conducts automatic mining of discriminative joints together with explicitly learning and allocating the content-dependent attentions to the output of each frame to boost recognition performance.

Joint Classification Regression RNN (JCRRNN): Besides proposing the online action detection task, Li et al. proposed a Joint Classification Regression RNN which implement frame level real-time action detection.

2 PKU-MMD Detection Benchmarks

In the detection task, the goal is to find and recognize all activity instances in an untrimmed video. Detection algorithms should provide the start and end points with action labels. We exploit the location annotations of PKU-MMD to compare the performances of above methods.

As the skeleton is an effective representation, we implement several experiments to evaluate the ability to model dynamics and activity boundaries localizing. Table 4 shows the comparison of different combination of skeleton representation and temporal featuring methods. It can be seen that the Deep Optical Flow beats other traditional features owning to its higher accuracy in motion description. STA-LSTM performs worse than BLSTM mainly due to the large margin in amount of parameters. Joint classification regression RNN achieves remarkable results, because it utilizes frame-level predictions and thus is more compatible with stricter localization requirements.

We further analyze the different performances with several sliding-window approaches. We show Precision-Recall curves of RS + BLSTM method in Figure 2. The performance is influenced by window size and stride. When stride is fixed, windows in smaller size contain less context information while noises can be involved by larger window size. However, smaller window size always leads to higher computation complexity. And obviously, smaller stride achieves better results due to dense sampling while costing more time. In our following experiments, we set 30 as window size and stride as a trade-off between performance and speed.

2.2 Multi-Modality Scenarios

In this task, we evaluate the capability of detecting activity in multi-modality scenarios. Together with raw skeleton data, we calculate the first order differential of sequential skeleton input. Moreover, two deep features are extracted from fine-tuned deep convolution network: Deep RGB feature, Deep Optical Flow feature. We first evaluate the independent performances of different feature representations on long video sequences by classifying the 10-frame length sliding windows using SVM. Cross-view evaluation and cross-subject evaluation are both implemented in this evaluation. Then the combination of different modalities is observed, the result is shown in Table 3.

Conclusion

In this paper, we propose a large-scale multi-modality 3D dataset (PKU-MMD) for human activity understanding, especially for action detection which demands localizing temporal boundaries and recognizing activity category. Performed by 66 actors, our dataset includes 1076 long video sequences, each of which contains 20 action instances of 51 action classes. Compared with current 3D datasets for temporal detection, our dataset is much larger (3000 minutes and 5.4 million frames in total) and contains much varieties (3 views, 66 subjects) in different aspects. The multi-modality attribution and larger scale of the collected data enable further experiments on deep networks like LSTM or CNN. Based on several detection retrieval protocols, we design a new 2D-AP evaluation for action detection task which takes both overlapping and detection confidence into consideration. We also design plenty experiments to evaluate several detection methods on PKU-MMD benchmarks. The results show that existing methods are not satisfied in terms of performance. Thus, large-scale 3D action detection is far from being solved and we hope this dataset can draw more studies in action detection methodologies to boost the action detection technology.