Co-occurrence Feature Learning for Skeleton based Action Recognition using Regularized Deep LSTM Networks
Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng, Yanghao Li, Li Shen, Xiaohui Xie
Introduction
Recognizing human actions has remained one of the most important and challenging tasks in computer vision. It facilitates a wide range of applications such as intelligent video surveillance, human-computer interaction, and video understanding (?; ?).
Traditional studies on action recognition mainly focus on recognizing actions from RGB videos recorded by 2D cameras (?). However, capturing human actions in the full 3D space in which they actually occur can provide more comprehensive information. Biological observations suggest that humans can recognize actions from just the motion of a few light displays attached to the human body (?). Motion capture systems (?) extract 3D joint positions using markers and high precision camera arrays. Although slightly higher in price, such systems provide highly accurate joint positions for skeletons. Recently, the Kinect device has gained much popularity thanks to its excellent accuracy in human body modeling and affordable price. The bundled SDK for Kinect v2 can directly generate accurate skeletons in real-time. Due to the prevalence of these devices, skeleton based representations of the human body and its temporal evolution has become an attractive option for action recognition.
In this paper, we focus on the problem of skeleton based action recognition. The key to this problem lies mainly in two aspects. One is to design robust and discriminative features from the skeleton (and the corresponding RGBD images) for intra-frame content representation (?; ?; ?; ?; ?). The other is to explore temporal dependencies of the inter-frame content for action dynamics modeling, using hierarchical maximum entropy Markov model (?), hidden Markov model (?) or Conditional Random Fields (?). Inspired by the success of deep recurrent neural networks (RNNs) using the Long Short-Term Memory (LSTM) architecture for speech feature learning and time series modeling (?; ?), we intend to build an effective action recognition model based on deep LSTM network.
To this end, we propose an end-to-end fully connected deep LSTM network to perform automatic feature learning and motion modeling (Fig. 1). The proposed network is constructed by inheriting many insights from recent successful networks (?; ?; ?; ?) and is designed to robustly model complex relationships among different joints. The LSTM layers and feedforward layers are alternately deployed to construct a deep network to capture the motion information. To ensure the model learns effective features and motion dynamics, we enforce different types of strong regularization in different parts of the model, which effectively mitigates over-fitting.
Specifically, two types of regularizations are proposed. (i) For the fully connected layers, we introduce regularization to drive the model to learn co-occurrence features of the joints at different layers. (ii) For the LSTM neurons, we derive a new dropout and apply it to the LSTM neurons in the last LSTM layer, which helps the network to learn complex motion dynamics. With these forms of regularization, we validate our deep LSTM networks on three public datasets for human action recognition. The proposed model has been shown to consistently outperform other state-of-the-art algorithms for skeleton based human action recognition.
Related Work
In contrast to the handcrafted features, there is a growing trend of learning robust feature representations from raw data with deep neural networks, and excellent performance has been reported in image classification (?) and speech recognition (?). However, there are only few works which leverage neural networks for skeleton based action recognition. A multi-layer perceptron network is trained to classify each frame (?); however, such a network cannot explore temporal dependencies very well. In contrast, a gesture recognition system (?) employs a shallow bidirectional LSTM with only one forward hidden layer and one backward hidden layer to explore long-range temporal dependencies. A deep recurrent neural network architecture with handcrafted subnets is utilized for skeleton based action recognition (?). However, the handcrafted hierarchical subnets and their fusion ignore the inherent co-occurrences of joints. This motivates us to design a deep fully connected neural network which is capable of fully exploiting the inherent correlations among skeleton joints in various actions.
2 Co-occurrence Exploration
An action is usually only associated with and characterized by the interactions and combinations of a subset of the skeleton joints. For example, the joints “hand”, “arm” and “head” are associated with the action “making telephone call”. An actionlet ensemble model exploits this trait by mining some particular conjunctions of the features corresponded to some subsets of the joints (?). Similarly, actions involving two people can be characterized by the interactions of a subset of the two persons’ joints (?; ?). Inspired by the actionlet ensemble model, we introduce a new exploration mechanism in the deep LSTM architecture to achieve automatic co-occurrence mining as opposed to pre-specifying in advance which joints should be grouped.
3 Dropout for Recurrent Neural Networks
Dropout has been demonstrated to be quite effective in deep convolutional neural networks (?), but there has been relatively little research on applying it to RNNs. In order to preserve the ability of RNNs to model sequences, dropout applied only to the feedforward (along layers) connections but not to the recurrent (along time) connections is proposed (?). This is to avoid erasing all the information from the units (due to dropout). Note that the previous work only considers dropout at the output response for an LSTM neuron (?). However, considering that an LSTM neuron consists of internal cell and gate units, we believe one should not only look at the output of the neuron but also into its internal structure to design effective dropout schemes. In this paper, we design an in-depth dropout for LSTM to address this problem.
Deep LSTM with Co-occurrence Exploration and In-depth Dropout
Leveraging the insights from recent successful networks, we design a fully connected deep LSTM network for skeleton based action recognition. Fig. 1 shows the architecture of the proposed network, which has three bidirectional LSTM layers, two feedforward layers, and a softmax layer that gives the predictions. The proposed full connection architecture enables one to fully exploit the inherent correlations among skeleton joints. In the network, the co-occurrence exploration is applied to the connections prior to the second LSTM layer to learn the co-occurrences of joints/features. LSTM dropout is applied to the last LSTM layer to enable more effective learning. Note that each LSTM layer uses bidirectional LSTM and we do not explicitly distinguish the forward and backward LSTM neurons in Fig. 1. At each time step, the input to the network is a vector denoting the 3D positions of the skeleton joints in a frame.
In the following, we first review LSTM briefly to make the paper self-contained. Then we introduce our method for co-occurrence exploration in the deep LSTM network. Lastly we describe our dropout algorithm which is designed for the LSTM neurons and enables effective learning of the model.
The RNN is a successful model for sequential learning (?). For the recurrent neurons at some layer, the output responses are calculated based on the inputs to this layer and the responses from the previous time slot
where denotes the activation function, denotes the bias vector, is the matrix of weights between the input and hidden layer and is the matrix of recurrent weights from the hidden layer to itself at adjacent time steps which is used for exploring temporal dependency.
LSTM is an advanced RNN architecture which can learn long-range dependencies (?). Fig. 2 shows a typical LSTM neuron, which contains an input gate , a forget gate , a cell , an output gate and an output response . The input gate and forget gate govern the information flow into and out of the cell. The output gate controls how much information from the cell is passed to the output . The memory cell has a self-connected recurrent edge of weight 1, ensuring that the gradient can pass across many time steps without vanishing or exploding. Therefore, it overcomes the difficulties in training the RNN model caused by the “vanishing gradient” effect. For all the LSTM neurons in some layer, at time , the recursive computation of activations of the units is
where denotes element-wise product, is the sigmoid function defined as , is the weight matrix between and (e.g., is the weight matrix from the inputs to the input gates ), and denotes the bias term of with . Four weight matrixes are associated with input . To allow the information from both the future and the past to determine the output, bidirectional LSTM can be utilized (?).
2 Co-occurrence Exploration
The fully connected deep LSTM network in Fig. 1 has very powerful learning capability. However, it is difficult to learn directly due to the huge parameter space. To overcome this problem, we introduce a co-occurrence exploration process to ensure the deep model learns effective features.
The co-occurrence of some joints can intrinsically characterize a human action. Fig. 3 shows two examples. For “walking”, the joints from hands and feet have high correlations but they all have low correlations with the joint of root. The sets of correlated joints for “walking” and “drinking” are very different, indicating the discriminative subset of joints varies for different types of actions. Two main aspects have been considered in our design of the network and the specialized regularization we propose. (i) We expect the network can automatically explore the conjunctions of discriminative joints. (ii) We expect the network can explore different co-occurrences for different types of actions. Therefore, we design the fully connected network to allow each neuron being connected to any joints (for the first layer) or responses of the previous layer (for the second or higher layer) to automatically explore the co-occurrences. Note that the output responses are also referred to as features which are the input of the next layer. We divide the neurons in the same layer into groups to allow different groups to focus on exploration of different conjunctions of discriminative joints. Taking the group of neurons as an example (see Fig. 4 (a)), the neurons will automatically connect the discriminative joints/features. In our design, we incorporate the co-occurrence regularization into the loss function
The stochastic gradient descent method is then employed to solve (3). The advantage of the co-occurrence learning is that the model can automatically learn the discriminative joint/feature connections, avoiding the fixed a priori blocking of joint co-occurrences across human parts (?) as illustrated in Fig. 4 (b).
3 In-depth Dropout for LSTM
Dropout tries to combine the predictions of many “thinned” networks to boost the performance. During training, the network randomly drops some neurons to force the remaining sub-network to compensate. During testing, the network uses all the neurons together to make predictions.
To extend this idea to LSTM, we propose a new dropout algorithm to allow the dropping of the internal gates, cell and output response for an LSTM neuron, encouraging each unit to learn better parameters. For clarity, an LSTM neuron is shown in Fig. 5 (a) in the unfolded form, where the units are explicitly connected. For recurrent neural networks, the erasing of all the information from a unit is not expected, especially when the unit remembers events that occurred many timesteps back in the past (?). Therefore, we allow the influence of dropout in LSTM to flow along layers (marked by dashed arrows) but prohibit it to flow along the time axis (marked by solid arrows) as illustrated in Fig. 5 (b). To control the influence flows, in the feedforward process, the network calculates and records two types of activations as follows. The responses of units to be transmitted along the time without dropout are
The responses of units to be transmitted across layers with dropout applied are
where and are dropout binary mask vectors for input gates, forget gates, cells, output gates and output responses, respectively, with an element value of 0 indicating that dropout happens. Note that for the first LSTM layer, the inputs are the skeleton joints of a frame; for the higher LSTM layer, the inputs are the response outputs of the previous layer.
During the training process, the errors back-propagated to the output responses are
By taking derivative of with respect to based on (5), we get the errors from to which represent the errors from upper layer with dropout involved
Then, the errors back-propagated to the output gates are the summation of the two types of errors
In the same way, we derive the errors propagated to the cells, forget gates, and input gates, based on (4) and (5).
During the testing process, we use all the neurons but multiplying the units of LSTM neurons (in the LSTM layer where dropout is applied) by the probability values of , where is the dropout probability of that unit. Note that the simple dropout which only drops the output responses (?) is a special case of our proposed dropout.
4 Action Recognition using the Learned Model
With the learned deep LSTM network, the probability that a sequence belongs to the class is
where denotes the number of classes, represents the length of the test sequence, , and denote the output responses of the last bidirectional LSTM layer. Then, the class with the highest probability is chosen as action class.
Experiments
We validate the proposed model on the SBU kinect interaction dataset (?), HDM05 dataset (?), and CMU dataset (?) whose groundtruth was labeled by ourselves. We have also tested our model on the Berkeley MHAD action recognition dataset (?) and achieved 100% accuracy. To investigate the impact of each component in our model, we conduct experiments under different configurations represented as follows:
Deep LSTM is our basic scheme without regularizations;
Deep LSTM + Co-occurrence is the scheme with our proposed co-occurrence regularization applied;
Deep LSTM + Simple Dropout is the scheme with the dropout algorithm proposed by Zaremba et al. (?) applied to our basic scheme;
Deep LSTM + In-depth Dropout is the scheme with our proposed in-depth dropout applied;
Deep LSTM + Co-occurrence + In-depth Dropout is our final scheme with both co-occurrence regularization and in-depth dropout applied.
Down-sampling the skeleton sequences in temporal is performed to have the frame rate of 30fps on the HDM05 dataset and CMU dataset. To reduce the influence of noise in the captured skeleton data, we smooth each joint’s position of the raw skeleton using the filter in the temporal domain (?; ?). The number of groups () in our model is set to 5, 10, and 10 for the first three layers experimentally. We set the dropout probability to 0.2 for each unit in an LSTM neuron, which makes the overall dropout probability of an LSTM neuron approach 0.5 (this can be derived based on (5)). Note that when dropout is applied, the number of neurons in the corresponding layer is doubled as suggested by previous work (?). We set the parameters and in (3) experimentally.
The SBU kinect interaction dataset is a Kinect captured human activity recognition dataset depicting two person interaction, which contains 230 sequences of 8 classes (6,614 frames) with subject independent 5-fold cross validation. The smoothed positions of joints are used as the input of the deep LSTM network for recognition. The number of neurons is set to 100, 100, 110, 110, 200 for the first to fifth layers respectively, where 2 indicates bidirectional LSTM is used and thus the number of neurons is doubled.
We have compared our schemes with other skeleton based methods (?; ?; ?). Note that we add an additional layer to fuse the two subnets corresponding to the two persons when extending Hierarchical RNN scheme for use in the two person interaction scenario (?). We summarize the results in terms of the average recognition accuracy (5-fold cross validation) in Table 1.
Table 1 shows that our basic scheme of Deep LSTM achieves comparable performance to the method using handcrafted complex features (?). The proposed schemes of Deep LSTM + Co-occurrence and Deep LSTM + In-depth Dropout can improve the recognition accuracy by 3.4% and 4.1% respectively over Deep LSTM, indicating that the co-occurrence exploration boosts the discrimination of features and the proposed LSTM dropout is capable of learning a more effective model. Deep LSTM + In-depth Dropout is superior to Deep LSTM + Simple Dropout. Note that the deep LSTM network achieves remarkable (5.6%) performance improvement in comparison with the hierarchical RNN network (?). That is because allowing full connection of joints/features with neurons rather than imposing a priori subnet constraints facilitates the interaction among joints especially when the joints do not belong to the same part, or same person. Our scheme with combined regularizations achieves the best performance.
2 HDM05 Dataset
The HDM05 dataset contains 2,337 skeleton sequences performed by 5 actors (184,046 frames after down-sampling). For fair comparison, we use the same protocol (65 classes, 10-fold cross validation) as used by Cho and Chen (?). The pre-processing is the same as that done in the hierarchical RNN scheme (?) (centralize the joints’ positions to human center for each frame and smooth the positions). The number of neurons is 100, 110, 120, 120, 200 for the five layers respectively. Table 2 shows the results in terms of average accuracy. Our basic deep LSTM achieves better results than the Multi-layer Perception model, which suggests that LSTM exhibits better motion modeling ability than the MLP. With the proposed co-occurrence learning and in-depth dropout regularization, our full model also performs better than the manually designed hierarchical RNN approach.
3 CMU Dataset
We have categorized the CMU motion capture dataset into 45 classes for the purpose of skeleton based action recognitionhttp://www.escience.cn/people/zhuwentao/29634.html. The categorized dataset contains 2,235 sequences (987,341 frames after down-sampling) and is the largest skeleton based human action dataset so far. This dataset is much more challenging because: (i) the lengths of sequences vary greatly; (ii) the within-class diversity is large, e.g., for “walking”, different people walk at different speeds and along different paths; (iii) the dataset contains complex actions such as dance, doing yoga.
We have evaluated the performance on both the entire dataset (CMU) and a subset of the dataset (CMU subset). For this subset, we have chosen eight representative action categories containing 664 sequences (125,667 frames after down-sampling), with actions of jump, walk back, run, sit, getup, pickup, basketball, cartwheel. The same pre-processing as used for the HDM05 dataset is performed. The number of neurons is set to 100, 100, 120, 120, 100 for the five layers. Three-fold cross validation is conducted and the results in terms of average accuracy are shown in Table 3. We can see that the proposed model achieves significant performance improvement, indicating that it can better learn the discriminative features and model long-range temporal dynamics even for this challenging dataset.
4 Discussions
To further understand our deep LSTM network, we visualize the weights learned in the first LSTM layer on the SBU kinect interaction dataset in Fig. 6 (a). Each element represents the absolute value of the weight between the corresponding skeleton joint and input gate of that LSTM neuron. It is observed that the weights in the diagonal positions marked by the red ellipse have high values, which means the co-occurrence regularization helps learn the human parts automatically. In contrast to the part based subnet fusion model (?), the learned co-occurrences of joints by our model do not limit the connections to be in the same part, as there are many large weights outside the diagonal regions, e.g., in the regions marked by white circles, making the network more powerful for action recognition. This also signifies the importance of the proposed full connection architecture. By averaging the energy of the weights in the same group of neurons for each joint, we obtain Fig. 6 (b) which has five groups of LSTM neurons. It is observed that different groups have different weight patterns, preferring different conjunctions of joints.
Conclusion
In this paper, we propose an end-to-end fully connected deep LSTM network for skeleton based action recognition. The proposed model facilitates the automatic learning of feature co-occurrences from the skeleton joints through our designed regularization. To ensure effective learning of the deep model, we design an in-depth dropout algorithm for the LSTM neurons, which performs dropout for the internal gates, cell, and output response of the LSTM neuron. Experimental results demonstrate the state-of-the-art performance of our model on several datasets.
Acknowledgment
We would like to thank David Wipf from Microsoft Research Asia for the valuable discussions, and thank Yong Du from Institute of Automation, Chinese Academy of Sciences for providing Hierarchical RNN code for the comparison.