An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jiaying Liu

Introduction

Recognition of human action is a fundamental yet challenging task in computer vision. It facilitates many applications such as intelligent video surveillance, human-computer interaction, video summary and understanding (?; ?). The key to the success of this task is how to extract discriminative spatial temporal features to effectively model the spatial and temporal evolutions of different actions.

One general approach focuses on the recognition from RGB videos (?). Since each frame is a capture of the highly articulated human in a two-dimensional space, it loses some information of the three-dimensional (3D) space and then loses the flexibility of achieving human location and scale invariance. The other general approach leverages the high level information of skeleton data, which represents a person by the 3D coordinate positions of key joints (i.e., head, neck, $\cdots$ , foot). Such representation is robust to variations of locations and viewpoints. Without combining RGB information, there is a lack of appearance information. Fortunately, biological observations from the early seminal work of Johansson suggest that the positions of a small number of joints can effectively represent human behavior even without appearance information (?). Skeleton-based human representation has attracted increasing attention for recognizing human actions thanks to its high level representation and robustness to variations of locations and appearances (?). The prevalence of cost-effective depth cameras such as Microsoft Kinect (?) and the advance of a powerful human pose estimation technique from depth (?) make 3D skeleton data easily accessible. This boosts research on skeleton-based human action recognition. In this work, we focus on recognition from skeleton data.

Fig. 1 shows an example of a series of skeleton frames (and RGB images) for the action “punching”. Each human body is represented by key joints in terms of coordinate positions in the 3D space. The articulated configurations of joints constitute various postures and a series of postures in a certain time order identifies an action. With the skeleton as an explicit high level representation of human pose, many works design algorithms taking the positions of joints as inputs. There are two basic components in these works. One is the design and mining of discriminative features from the skeleton, such as histograms of 3D joint locations (HOJ3D) (?), pairwise relative position features (?), relative 3D geometry features (?). The other is the modeling of temporal dynamics, such as Hidden Markov Model (?), Conditional Random Fields (?), and Recurrent Neural Networks (?). In this work, we present a spatio-temporal attention model to incorporate the two components into an end-to-end deep learning architecture.

For spatial joints of skeleton, we propose a spatial attention module which conducts automatic mining of discriminative joints. A certain type of action is usually only associated with and characterized by the combinations of a subset of kinematic joints (?). As the action proceeds, the associated joints may also change accordingly. For example, the joints “hand”, “elbow”, and “head” are discriminative for the action “drinking” while the joints from legs can be considered as noise. For an action “approaching and shaking hands”, at the beginning, the legs may be paid attention to; at the middle stage, the arms attract more attention. In contrast to actionlet (?), the attentions to joints are allowed to vary over time, being content-dependent.

Furthermore, for a sequence of frames, we propose a temporal attention module which explicitly learns and allocates the content-dependent attentions to the output of each frame to boost recognition performance. For a sequence of some action, the flow of the action may experience different stages, e.g., the preparation, climax, and the end (Fig. 1). Taking the action “punching” as an example, the two persons approach each other, stretch out the hands, and kick out the legs. The frames for identifying stretching out the hands and kicking out the legs are a part of the key sub-stage. Different sub-stages have different degrees of importance and robustness to variations. In this paper, in contrast to the ideas of extracting key frames (?; ?), our proposed scheme pays different attentions to different frames instead of simply skipping frames.

In summary, we have made the following four main contributions in this work.

An end-to-end framework with two types of attention modules is designed based on the LSTM networks for skeleton based human action recognition.

A spatial attention module with joint-selection gates is designed to adaptively allocate different attentions to different joints of the input skeleton within each frame. A temporal attention module with frame-selection gate is designed to allocate different attentions to different frames.

Spatio-temporal regularizations are proposed to enable the better learning of the networks.

A joint training strategy is designed to efficiently train the entire end-to-end network.

Related Work

An action is usually associated with and characterized by the interactions and combinations of a subset of skeleton joints. An actionlet ensemble model is proposed to mine such discriminative joints (?), where an actionlet is a particular conjunction of the features for a subset of the joints and an action is represented as a linear combination of the actionlets. For example, for the action “drinking”, the subset of joints including “hand”, “elbow”, and “head” composes a actionlet. Orderlet makes an extension of actionlet by including the feature of pairwise joint distance and allowing various sizes of a subset (?). Actionlets or orderlets are mined from training samples for robust performance. In a recurrent neural network, a group sparsity constraint is introduced to the connection matrix to encourage the network to explore the co-occurrence of joints (?).

In the above methods, once the mining is done, the degrees of importance of joints/features are fixed and there will be no change for different temporal frames and sequences. In contrast, our spatial attention module determines the degrees of importance of joints on the fly based on the contents.

2 Temporal Key Frame Exploration

For identifying an action, not all frames in a sequence have the same importance. Some frames capture less meaningful information, or even carry misleading information associated with other types of actions, while some other frames carry more discriminative information (?). A number of approaches have proposed using key frames as a representation for action recognition. One is to utilize the conditional entropy of visual words to measure the discriminative power of a given frame and the classification results from the top 25% most discriminative frames are employed to make a majority vote for recognition (?). Another one employs the AdaBoost algorithm to select the most discriminative key frames for human action recognition (?). The learning of key frames can also be cast in a max-margin discriminative framework by treating them as latent variables (?).

Leveraging key frames can help exclude noise frames, e.g., frames which are less relevant to the underlying actions. However, in comparisons to the holistic based approaches (?; ?; ?) which use all the frames, it loses some information. In this paper, our temporal attention module determines the degree of importance for each frame. Instead of skipping frames, it allocates different attention weights to different frames to automatically exploit their respective discriminative power and focus more on the important frames.

3 Attention-Based Models

When observing the real-world, a human usually focuses on some fixation points at the first glance of the scene, i.e., paying different attentions to different regions (?). Many applications leverage predicted saliency maps for performance enhancement (?; ?; ?), which explicitly learn the saliency maps guided by human labeled groundtruths.

The human labeled groundtruths for the explicit attention, however, are generally unavailable and might not be consistent with real attention related to the specific tasks. Recently, the exploitation of an attention model which implicitly learns attention has attracted increasing interest in various fields, such as machine translation (?), image caption generation (?), and image recognition (?). Selective focus on different spatial regions is proposed for action recognition on RGB videos (?). Ramanathan et al. propose an attention model which learns to detect events in RGB videos while attending to the people responsible for the event (?). The fusion of neighboring frames within a sliding window with learned attention weights is proposed to enhance the performance of dense labeling of actions in RGB videos (?). However, all the attention models mentioned above for action recognition are based on RGB videos. There is a lack of investigation of skeleton sequences, which exhibit different characteristics from RGB videos.

Overview of RNN and LSTM

In this section, we briefly review the Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) to make the paper self-contained.

RNN is a popular model for sequential data modeling and feature extraction (?). Fig. 2(a) shows an RNN neuron. The output response ${h}_{t}$ at time step $t$ is determined by the input $\mathbf{x}_{t}$ and the hidden outputs from RNN themselves at the last time step $\mathbf{h}_{t-1}$

where $\theta$ represents a non-linear activation function, $\mathbf{w}_{xh}$ and $\mathbf{w}_{hh}$ denote the learnable connection vectors, and ${b}_{h}$ is the bias value. The recurrent structure and the internal memory of RNN facilitate its modeling of the long-term temporal dynamics of the sequential data.

LSTM is an advanced RNN architecture which mitigates the vanishing gradient effect of RNN (?; ?; ?). As illustrated in Fig. 2(b), an LSTM neuron contains a memory cell ${c}_{t}$ which has a self-connected recurrent edge of weight 1. At each time step $t$ , the neuron can choose to write, reset, and read the memory cell governed by the input gate ${i}_{t}$ , forget gate ${f}_{t}$ and output gate ${o}_{t}$ .

Deep LSTM with Spatio-Temporal Attention Model

We propose an end-to-end multi-layered LSTM network with spatial and temporal attention mechanisms for action recognition. The network is designed to automatically select dominant joints within each frame through the spatial attention module, and assign different degrees of importance to different frames through the temporal attention module. Fig. 3 shows its overall architecture, which consists of a main LSTM network, a spatial attention subnetwork, and a temporal attention subnetwork. Because of the inter-play among the three subnetworks, it is challenging to train the network.

In the following, we discuss the proposed spatial attention module and temporal attention module respectively, which are both built based on the LSTM networks. We then introduce a regularized learning objective of our model and a joint training strategy, which help overcome the difficulty of model learning for the highly coupled network.

The action of persons can be described by the evolution of a series of human poses represented by the 3D coordinates of joints. In general, different actions involve different subsets of joints as discussed in Section 2.1.

We propose a spatial attention model to automatically explore and exploit the different degrees of importance of joints. With a soft attention mechanism, each joint within a frame is assigned a spatial attention weight based on the joint-selection gates. This enables our model to adaptively focus more on those discriminative joints.

where ${U}_{s}$ , ${W}_{xs}$ , ${W}_{hs}$ are the learnable parameter matrixes, $\mathbf{b}_{s}$ , $\mathbf{b}_{us}$ are the bias vectors. $\mathbf{h}_{t-1}^{s}$ denotes the hidden variable from an LSTM layer as illustrated in Fig. 3. For the $k^{th}$ joint, the activation as the joint-selection gate is computed as

Note that the proposed spatial attention model determines the importance of joints based on all the joints of the current time step and the hidden variables from an LSTM layer. On one hand, the hidden variables $\mathbf{h}_{t-1}$ contain information of past frames, benefiting from the merit of LSTM which is capable of exploring temporal long range dynamics. In this paper, the spatial attention subnetwork composes of an LSTM layer, two fully connected layers and a normalization unit as illustrated in Fig. 3. On the other hand, leveraging all joints within the current frame provides necessary ingredient for determining their importance.

Bridged by the joint-selection gate, the main LSTM network and the spatial attention subnetwork can be jointly trained to implicitly learn the spatial attention model.

2 Temporal Attention with Frame-Selection Gate

For a sequence, the amount of valuable information provided by different frames is in general not equal. Only some of the frames (key frames) contain the most discriminative information while the other frames provide context information. For example, for the action “shaking hands”, the sub-stage “approaching” should have lower importance than the sub-stage of “hands together”. Based on such insight, we design a temporal attention module to automatically pay different levels of attention $\beta$ to different frames.

For the sequence level classification, based on the output $\mathbf{z}_{t}$ of the main LSTM network and the temporal attention value $\beta_{t}$ at each time step $t$ , the scores for $C$ classes are the weighted summation of the scores at all time steps

As illustrated in Fig. 3, the attention module is composed of an LSTM layer, a fully connected layer, and a ReLU non-linear unit, being connected in series. It plays the role of soft frame selection. The activation as the frame-selection gate can be computed as

which depends on the current input $\mathbf{x}_{t}$ , and the hidden variables $\mathbf{h}_{t-1}^{\thicksim}$ of time step $t-1$ from an LSTM layer. We use the non-linear function of ReLU due to its good convergence performance. The gate controls the amount of information of each frame to be used for making the final classification decision. The works (?; ?) are our special cases where the attention weights on each frame are equal.

Bridged by the frame-selection gate, the main LSTM network and the temporal attention subnetwork can be jointly trained to implicitly learn the temporal attention model.

3 Joint Spatial and Temporal Attention

The purpose of the attention models is to enable the network to pay different levels of attention to different joints and assign different degrees of importance to different frames as an action proceeds. We integrate spatial and temporal attention in the same network as illustrated in Fig. 3. How the spatial attention model acts on the input and how the temporal attention model acts on the output of the main LSTM network are illustrated in Fig. 4.

We formulate the final objective function of the spatio-temporal attention network with a regularized cross-entropy loss for a sequence as,

The first regularization item is designed to encourage the spatial attention model to dynamically focus on more spatial joints in a sequence. We found the spatial attention model is prone to consistently ignoring many joints along time even though these joints are also valuable for determining the type of action, i.e., trapped to a local optimum. We introduce this regularization item to avoid such ill-posed solutions. For clarity, we re-describe it as $\sum_{t=1}^{T}\alpha_{t,k}\!\approx\!T$ , with $k=1,\cdots,K$ . This encourages paying equal attentions to different joints.

The second regularization item is to regularize the learned temporal attention values under control with $l_{2}$ norm rather than to increase them unboundedly. This alleviates gradient vanishing in the back propagation, where the back-propagated gradient is proportional to $1/\beta_{t}$ .

The third regularization item with $l_{1}$ norm is to reduce overfitting of the networks. ${W}_{uv}$ denotes the connection matrix (merged to one matrix here) in the networks.

Joint Training of the Networks

Due to the mutual influence of the three networks, the optimization is rather difficult. We propose a joint training strategy to efficiently train the spatial and temporal attention modules, as well as the main LSTM network. The separate pre-training of the attention modules ensures the convergence of the networks. The training procedure is described in Algorithm 1.

Experimental Results

We perform our experiments on the following two datasets: the SBU Kinect interaction dataset (?), and the largest RGB+D dataset of NTU (Shahroudy et al. 2016).

SBU Kinect Interaction Dataset (SBU). The SBU dataset is an interaction dataset with two subjects. It contains 230 sequences of 8 classes (6614 frames) with subject independent 5-fold cross validation. Each person has 15 joints and the dimension of the input vector is $15\times 3\times 2=90$ . Note that we smooth each joint’s position of the skeleton in the temporal domain to reduce the influence of noise (?; ?).

NTU RGB+D Dataset (NTU). The NTU dataset is currently the largest action recognition dataset with high quality skeleton (?). It contains 56880 sequences (with 4 million frames) of 60 classes, including Cross-Subject (CS) and Cross-View (CV) settings. Each person has 25 joints. We apply the similar normalization preprocessing step to have position and view invariance (?). To avoid destroying the continuity of a sequence, no temporal down-sampling is performed.

Implementation Details. For the network and parameter settings, we use three LSTM layers for the main LSTM network, and one LSTM layer for each attention network. Each LSTM layer composes of 100 LSTM neurons. We set $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ to $0.001$ , $0.0001$ , and $0.0005$ for the SBU dataset, and $0.01$ , $0.001$ and $0.00005$ for the NTU dataset experimentally. Adam (?) is adopted to automatically adjust the learning rate during optimization. The batch sizes for the SBU dataset and the NTU dataset are $8$ and $256$ respectively. Dropout is utilized to mitigate overfitting (?).

2 Visualization of the Learned Attentions

We analyze where the learned spatial and temporal attention attend to by visualizing the attention weights in the test.

Spatial Attention. For a sequence of action “kicking”, Fig. 5(a) shows the amplitude of the spatial attention weights on joints by the sizes of the red circles. We also present concrete attention values in Fig. 6. The attention weights on the left foot, right elbow and left hand of the right person are large. Meanwhile, the weights on the torso and right foot of the left person are large. Being content-dependent, the attentions vary across frames. The learned important types of joints are consistent with what human perceives.

Temporal Attention. Fig. 5(b) shows the temporal attention weights $\beta$ . Fig. 5(c) shows the differentiated attention weights (i.e., $\vartriangle\!\!\beta_{t}=\beta_{t}-\beta_{t-1}$ ) for “Kicking”. Since the LSTM network usually accumulates more information as time goes, the attention weight usually increases correspondingly. The increased amplitude of the attention weight, i.e., $\vartriangle\!\!\beta_{t}$ , can indicate the importance of the frame $t$ . We can see the differentiated attention weight goes up to a climax as the person on the right lifts his foot to the highest point, which human also considers as more discriminative.

3 Effectiveness of the Proposed Attention Models

To validate the effectiveness of our designs, we conduct experiments with different configurations as follows.

LSTM: main LSTM network without attention designs.

SA-LSTM(w/o reg.): LSTM + spatial attention without regularization (only includes $1^{st}$ and $4^{th}$ items in (7)).

SA-LSTM: LSTM + spatial attention network.

TA-LSTM(w/o reg.): LSTM + temporal attention without regularization(only includes $1^{st}$ and $4^{th}$ items in (7)).

TA-LSTM: LSTM + temporal attention network.

STA-LSTM: LSTM+spatio-temporal attention network.

Fig. 7 shows the performance comparisons on the SBU, NTU (Cross-Subject), NTU (Cross-View) datasets respectively. We can see in comparison with the baseline scheme LSTM, the introduction of the spatial attention module (SA-LSTM) and the temporal attention module (TA-LSTM) brings up to 5.1% and 6.4% accuracy improvement, respectively. The best performance is achieved by combining both modules (STA-LSTM).

In the objective function as defined in (7), the second and the third items for regularizations are designed for the spatial attention and temporal attention model, respectively. We can see they improve the performance of both spatial attention model and temporal attention model.

5 Comparisons to Other State-of-the-Art

We show performance comparisons of our final scheme with the other state-of-the-art methods in Table 1 and Table 2 for the SBU and NTU datasets, respectively. Thanks to the introduction of the spatio-temporal attention models with efficient regularizations and the training strategy, our model is capable of extracting discriminative spatio-temporal features. We can see that our scheme achieves about 10% accuracy gain on the NTU dataset for the Cross-Subject and Cross-View settings, respectively.

Conclusion

We present an end-to-end spatio-temporal attention model for human action recognition from skeleton data. To select discriminative joints automatically and adaptively, we propose a spatial attention module with joint-selection gates to assign different importance to each joint. To automatically exploit the different levels of importance of different frames, we propose a temporal attention module to allocate different attention weights to each frame of the whole sequence. Finally, we design a joint training procedure to efficiently combine spatial and temporal attention with a regularized cross-entropy loss. Experimental results demonstrate the effectiveness of our proposed model which achieves remarkable performance improvement in comparison with other state-of-the-art methods.