Part-based Graph Convolutional Network for Action Recognition

Kalpit Thakkar, P J Narayanan

Introduction

Recognizing human actions in videos is necessary for understanding them. Video modalities such as RGB, depth and skeleton provide different types of information for understanding human actions. The S-video (or Skeletal modality) provides 3D joint locations, which is a relatively high level information compared to RGB or depth. With the release of several multi-modal datasets [Shahroudy et al.(2016)Shahroudy, Liu, Ng, and Wang, Chunhui et al.(2017)Chunhui, Yueyu, Yanghao, Sijie, and Jiaying, Chen et al.(2015)Chen, Jafari, and Kehtarnavaz], action recognition from S-video has gained significant traction recently [Liu et al.(2016)Liu, Shahroudy, Xu, and Wang, Song et al.(2017)Song, Lan, Xing, Zeng, and Liu, Liu et al.(2017)Liu, Wang, Hu, Duan, and Kot, Zhang et al.(2017b)Zhang, Liu, and Xiao, Ke et al.(2017)Ke, Bennamoun, An, Sohel, and Boussaid].

Graph convolutions [Niepert et al.(2016)Niepert, Ahmed, and Kutzkov, Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst, Kipf and Welling(2016)] have been used to learn high level features from arbitrary graph structure. State-of-the-art action recognition from S-videos [Yan et al.(2018)Yan, Xiong, and Lin, Li et al.(2018b)Li, Cui, Zheng, Xu, and Yang] use graph convolutions, wherein the whole skeleton is treated as a single graph. It is, however, natural to think of human skeleton as a combination of multiple body parts. A body-part based representation can learn the importance of each part and their relations across space and time. We present a model using part-based graph convolutional network for recognizing actions from S-videos, using a novel part-based graph convolution scheme. The model attains better performance for recognition than a model entire skeleton as a single graph. Current models for skeletal action recognition [Yan et al.(2018)Yan, Xiong, and Lin, Li et al.(2018b)Li, Cui, Zheng, Xu, and Yang] use 3D coordinates as features at each vertex. Geometric features such as relative joint coordinates and motion features such as temporal displacements can be more informative for action recognition. Optical flow helps in action recognition from RGB videos [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, Tang, and Van Gool] and Manhattan line map helps in generating 3D layout from single image [Zou et al.(2018)Zou, Colburn, Shan, and Hoiem]. Geometric feature [Zhang et al.(2017b)Zhang, Liu, and Xiao] and kinematic features [Zanfir et al.(2013)Zanfir, Leordeanu, and Sminchisescu] have been used for skeletal action recognition before. Inspired by these observations, we use a geometric feature that encodes relative joint coordinates and motion feature that encodes temporal displacements at each vertex in our part-based graph convolution model to significant impact.

The major contributions of this paper are: (i) Formulation of a general part-based graph convolutional network (PB-GCN) which can be learned for any graph with well-known properties and its application to recognize actions from S-videos, (ii) Use of geometric and motion features in place of 3D joint locations at each vertex to boost recognition performance, and (iii) Exceeding the state-of-the-art on challenging benchmark datasets NTURGB+D and HDM05. The overview of our representation and signals is shown in Figure 1.

Related Work

Skeletal action recognition has been approached using techniques such as handcrafted feature encodings, complex LSTM networks, image encodings with pretrained CNNs and non-euclidean methods based on manifolds. Non-deep learning methods worked well initially and proved usefulness of several extracted information from S-videos such as joint angles [Ofli et al.(2014)Ofli, Chaudhry, Kurillo, Vidal, and Bajcsy], distances [Xia et al.(2012)Xia, Chen, and Aggarwal] and kinematic features [Zanfir et al.(2013)Zanfir, Leordeanu, and Sminchisescu]. These methods learn from hand designed features using shallow models which do not model spatio-temporal properties of actions very well and constrain learning capacity.

On the other hand, LSTM-based methods were used because S-videos can be thought of as time sequences of features. Spatio-temporal LSTMs [Liu et al.(2016)Liu, Shahroudy, Xu, and Wang, Liu et al.(2017)Liu, Wang, Hu, Duan, and Kot], attention-based LSTM [Song et al.(2017)Song, Lan, Xing, Zeng, and Liu] and simple LSTM networks with part-based skeleton representation [Tao and Vidal(2015), Du et al.(2015b)Du, Wang, and Wang] have been used. These methods either use complex LSTM models which have to be trained very carefully or use part-based representation with a simple LSTM model. We propose a part-based graph convolutional network that has good learning capacity and uses a part-based representation, inheriting the good qualities of both types of aforementioned approaches. Image encodings of skeletons were proposed to facilitate usage of Imagenet pretrained CNNs to extract spatio-temporal features. Ke et al\bmvaOneDot[Ke et al.(2017)Ke, Bennamoun, An, Sohel, and Boussaid] generate images using relative coordinates while Du et al\bmvaOneDot[Du et al.(2015a)Du, Fu, and Wang] and Li et al\bmvaOneDot[Li et al.(2018a)Li, He, Dai, Cheng, and Chen] proposed a body part-based image encoding. Due to inherent differences in information in such image encodings and RGB images, it is almost impossible to interpret the learned filters. In contrast, our method is intuitive as it uses a graph-based representation for human skeleton.

Manifold learning techniques have been used for skeletal action recognition, where actions are represented as curves on Lie groups [Vemulapalli et al.(2014)Vemulapalli, Arrate, and Chellappa] and Riemannian manifold [Devanne et al.(2015)Devanne, Wannous, Berretti, Pala, Daoudi, and Del Bimbo]. Deep learning on these manifolds is difficult [Huang et al.(2017)Huang, Wan, Probst, and Van Gool] while deep learning on graphs (also a manifold) has developed recently [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst, Kipf and Welling(2016)]. Our method uses a human skeleton graph and learns a model using part-based graph convolutional network, exploiting the benefits of deep learning on graphs.

2 Graph-based methods

Representing S-videos as skeleton graph sequences for recognizing actions had not been explored until recently. Li and Leung [Li and Leung(2017)] construct graphs using a statistical variance measure dependent on joint distances and match them for recognition. Recently, Yan et al\bmvaOneDot[Yan et al.(2018)Yan, Xiong, and Lin] and Li et al\bmvaOneDot[Li et al.(2018b)Li, Cui, Zheng, Xu, and Yang] proposed a spatio-temporal graph convolutional network for action recognition from S-videos. Both the methods construct graphs where the human skeleton is treated as a single graph. Our formulation explores a partitioned skeleton graph with a part-based graph convolutional network and we show that it improves recognition performance. Also, we use relative coordinates and temporal displacements as features at each vertex instead of 3D joint coordinates (see Figure 1(a)) which improves action recognition performance.

Background

Graph convolutions can be formulated using spectral graph theory [Defferrard et al.(2016)Defferrard, Bresson, and Vandergheynst] or spatial convolution [Niepert et al.(2016)Niepert, Ahmed, and Kutzkov] on graphs. We focus on spatial convolutions in this paper as they resemble convolutions on regular grid graphs like RGB images [Niepert et al.(2016)Niepert, Ahmed, and Kutzkov]. A graph CNN can then be formed by stacking multiple graph convolution units. Graph convolution (shown in Figure 2) can be defined as [Niepert et al.(2016)Niepert, Ahmed, and Kutzkov]:

where, $v_{i}$ is the root vertex at which the convolution is centered (like center pixel in an image convolution), $\mathbf{W}(\cdot)$ is a filter weight vector of size of $\mathcal{L}$ indexed by the label assigned to neighbor $v_{j}$ in the $k$ -neighborhood $\mathcal{N}_{k}(v_{i})$ , $\mathbf{X}(v_{j})$ is the input feature at $v_{j}$ and $\mathbf{Y}(v_{i})$ is the convolved output feature at root vertex $v_{i}$ . Equation 2 can be written in terms of adjacency matrix as:

$\mathbf{A}^{\mathbf{norm}}(i,j)$ basically defines the neighbors at distance $1$ and hence, Equation 2 captures a more general form of convolution by using $k$ -order neighborhood $\mathcal{N}_{k}(v_{i})$ .

Graphs representing real world manifolds can often be thought of as being made up of several parts. For instance, a graph representing a complex molecule consists of several simple structures, such as structure of a protein biomolecule, which can be divided into several polypeptide chains that make up the complex. Similarly, human body can be visualized as connected rigid parts, much like a deformable part-based model [Felzenszwalb and Huttenlocher(2005)]. The graph of the skeleton of human body can be divided into parts, where each subgraph represents a part of the human body.

In general, a part-based graph can be constructed as a combination of subgraphs where each subgraph has certain properties that define it. Let us consider that a graph $\mathcal{G}$ has been divided into $n$ partitions. Formally:

$\mathcal{P}_{p}$ is the partition (or subgraph) $p$ of the graph $\mathcal{G}$ . We consider scenarios in which the partitions can share vertices or have edges connecting them. We proceed to explain how the part-based graph convolution is defined for the part-based graph.

2 Part-based Graph Convolutions

In essence, graph convolutions over parts are aimed at capturing high-level properties of parts and learn the relations between them. In a Deformable Part-based Model, different parts are identified and relations between them are learned through the deformation of the connections between them. Similarly, graph convolutions over a part identifies the properties of that subgraph and an aggregation across subgraphs learns the relations between them. For a part-based graph, convolutions for each part are performed separately and the results are combined using an aggregation function $\mathcal{F}_{agg}$ . Using $\mathcal{F}_{agg}$ over edges across partitions:

Using $\mathcal{F}_{agg}$ for common vertices across partitions:

The convolution parameters $\mathbf{W}_{p}$ can be shared across parts or kept separate, while the neighbors of $v_{i}$ only in that part $(\mathcal{N}_{kp}(v_{i}))$ are considered. In order to combine the information across parts, the function $\mathcal{F}_{agg}$ combines information at shared vertices (equation 7) or shares information through edges crossing parts (equation 6, $\mathcal{E}_{(p1,p2)}$ contains all edges connecting parts p1 and p2), according to the partition configuration. A sophisticated $\mathcal{F}_{agg}$ can be employed to make the model powerful. Using graph convolutions, part-based graph models can learn rich representations and we demonstrate the strength of this model through application to action recognition from S-videos.

Spatio-temporal Part-based Graph Convolutions

The S-videos are represented as spatio-temporal graphs. In order to include the temporal dimension, corresponding joints in each part are connected temporally. Figure 3(b) shows the spatio-temporal graph for torso over five frames. Adapting select-assemble-normalize (patchy-san) proposed by Niepert et al\bmvaOneDot[Niepert et al.(2016)Niepert, Ahmed, and Kutzkov] we present an overview of convolution formulation for our spatio-temporal graph by extending ideas from section 3.2. For in-depth understanding, we refer the reader to [Niepert et al.(2016)Niepert, Ahmed, and Kutzkov]. We perform a spatial convolution on each partition following equation 5, combine the convolved partitions using $\mathcal{F}_{agg}$ and perform temporal convolution on the graph obtained by aggregating the partitions. In effect, we spatially convolve each partition independently for each frame, aggregate them at each frame and perform temporal convolution on the temporal dimension of the aggregated graph. For a possible partitioning of human skeleton, this phenomenon is shown in Figure 3(c) for spatial convolution for a vertex common to torso and head, 3(d) for spatial convolutions in different frames, 3(e) for applying $\mathcal{F}_{agg}$ on head + torso and 3(f) for convolution on temporal dimension of the combined graph.

We first define the spatial and temporal neighborhood of a vertex in spatio-temporal graph and assign labels to the vertices in the neighborhoods, which is required to perform convolutions. For each vertex, we use 1-neighborhood $(k=1)$ for spatial dimension $(\mathcal{N}_{1})$ as the skeleton graph is not very large and a $\tau$ -neighborhood $(k=\tau)$ for the temporal dimension $(\mathcal{N}_{\tau})$ . Figure 3(a) (dashed polygons) shows the spatial & temporal neighborhood for a root vertex. The different neighborhood sets for our model are defined as ( $\mathbf{d}(v_{i},v_{j})$ = length of shortest path between $v_{i}$ and $v_{j}$ ):

where, $t_{a}\hskip 2.84544pt\&\hskip 2.84544ptt_{b}$ represent two time instants and $p\in\{1,\ldots,n\}$ is the partition index. The set of vertices $\mathcal{V}_{p}$ differs for each part, with some vertices shared between parts (Figure 1(c)). As temporal convolution is performed on the aggregated spatio-temporal graph, $\mathcal{N}_{\tau}$ is not part-specific. Figure 3(a) shows the spatial and temporal neighborhoods for a root vertex in torso. For ordering vertices in the receptive fields (or neighborhoods), we use a single label spatially $(\mathbf{L}_{S}:\mathcal{V}\rightarrow\{0\})$ to weigh vertices in $\mathcal{N}_{1p}$ of each vertex equally and $\tau$ labels temporally $(\mathbf{L}_{T}:\mathcal{V}\rightarrow\{0,\ldots,\tau-1\})$ to weigh vertices across frames in $\mathcal{N}_{\tau}$ differently. The labeling functions are defined as:

Using the labeled spatial and temporal receptive fields, we define the spatial and temporal convolutions as (adapted from [Kipf and Welling(2016)]):

Human skeleton can be divided into two major components: (1) Axial skeleton and (2) Appendicular skeleton. The body parts included in these two components are shown in Figure 1(b). Human skeleton can be divided into parts based on these components. Different division schemes are shown in Figure 1(b), 1(c) and 1(d) and we use these schemes for experiments to test our PB-GCN.

For the final representation, we divide the human skeleton into four parts: head, hands, torso and legs, which corresponds to a division scheme where each of the axial and appendicular skeleton are divided into upper and lower components, as illustrated in Figure 1(c). We consider left and right parts of hands and legs together in order to be agnostic to laterality [Wikipedia(2015)] (handedness / footedness) of the human when performing an action. To show how being agnostic to laterality is helpful, we divide the upper and lower components of appendicular skeleton into left and right (shown in Figure 1(d)), resulting in six parts and show results on it. To cover all natural connections between joints in skeleton graph, we include an overlap of atleast one joint between two adjacent parts. For example, in Figure 1(c), shoulder joints are common between the head and hands. For the lower appendicular skeleton (viz. legs), we also include the joint at the base of spine to get a good overlap with lower axial skeleton.

We represent each subgraph by its adjacency matrix, normalized by corresponding degree matrix $\mathcal{D}$ . Our model takes as input a tensor having features for each vertex in the spatio-temporal graph of S-video and outputs a vector of class scores for the video. The architecture of the graph convolutional network is similar to Yan et al\bmvaOneDot[Yan et al.(2018)Yan, Xiong, and Lin] and consists of $9$ spatio-temporal graph convolution units (each unit with the four $\mathbf{W}_{p}$ kernels, one $\mathbf{W}_{T}$ kernel and a residual) with an initial spatio-temporal head unit, based on a Resnet-like model [He et al.(2016)He, Zhang, Ren, and Sun]. First three layers have 64 output channels, next three have 128 and last three have 256. We also use a learnable edge weight mask for learning edge weights in each subgraph [Yan et al.(2018)Yan, Xiong, and Lin]. We use the Pytorch framework [Paszke et al.(2017)Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer] for our implementation. The code and models are made publicly available: https://github.com/dracarys983/pb-gcn.

Geometric & Kinematic Signals

Yan et al\bmvaOneDot[Yan et al.(2018)Yan, Xiong, and Lin] use the 3D coordinates of each joint directly as the signal at each graph node. Relative coordinates [Zhang et al.(2017b)Zhang, Liu, and Xiao, Ke et al.(2017)Ke, Bennamoun, An, Sohel, and Boussaid] and temporal displacements [Zanfir et al.(2013)Zanfir, Leordeanu, and Sminchisescu] of joints have been used earlier for action recognition. Derived information like optical flow and Manhattan line map has been found useful on RGB images also [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, Tang, and Van Gool, Zou et al.(2018)Zou, Colburn, Shan, and Hoiem]. Even a CNN framework can be more effective and efficient if relevant derived information is supplied as input to the network.

We use a signal at each node that combines temporal displacements across time and relative coordinates, with respect to shoulders and hips [Ke et al.(2017)Ke, Bennamoun, An, Sohel, and Boussaid]. This representation provides translation invariance to the representation [Verma et al.(2018)Verma, Boyer, and Verbeek] and improves skeletal action recognition performance significantly. Figure 1(a) illustrates the computation of the two signals for a single skeleton video frame. We show the effect of relative joint coordinates (geometric signal) and temporal displacements (kinematic signal) individually and the performance improvement obtained by using a combination of these signals for a baseline one-part model as well as our four part-based model in the Table 1(b). The improvement in performance obtained using the geometric and kinematic signals is noteworthy.

Experimental Setup and Results

We use SGD as the optimizer and run the training for 80 epochs (NTURGB+D) / 120 epochs (HDM05). We set the initial learning rate to 0.1 and all the experiments are run on a cluster with $4$ Nvidia GTX 1080Ti GPUs. The batch size is set to 64. Learning rate decay schedule (set to decay by 0.1 at epochs 20, 50 and 70 for NTURGB+D, and at epoch 80 for HDM05) is finalized using a validation set. No augmentation is performed for any of the experiments, consistent with graph-based method [Yan et al.(2018)Yan, Xiong, and Lin]. We perform ablation studies on the large-scale NTURGB+D dataset (shown in Table 1) and then compare with state-of-the-art on both HDM05 and NTURGB+D using the best configuration of our model (shown in Table 2).

[Shahroudy et al.(2016)Shahroudy, Liu, Ng, and Wang] This is currently the largest RGBD dataset for action recognition to the best of our knowledge. It has 56,880 video sequences shot with three Microsoft Kinect v2 cameras from different viewing angles. There are 60 classes among the action sequences and 3D coordinates of 25 joints are provided for each human skeleton tracked. There is a large variation in viewpoint, intra-class subjects and sequence lengths, which makes this dataset challenging. We remove 302 of the captured samples having missing or incomplete skeleton data. The protocol mentioned in Shahroudy et al\bmvaOneDot[Shahroudy et al.(2016)Shahroudy, Liu, Ng, and Wang] is followed for comparisons with previous methods.

HDM05

[Müller et al.(2007)Müller, Röder, Clausen, Eberhardt, Krüger, and Weber] This dataset was captured by using an optical marker-based Vicon system. It contains 2337 action sequences ranging across 130 motion classes performed by five actors. This dataset currently has the largest number of motion classes. The actors are named “bd”, “bk”, “dg”, “mm” and “tr”, and 31 joints are annotated for each skeleton. This dataset is challenging due to intra-class variations induced by multiple realizations of same action and large number of motion classes. We follow the protocol given in [Huang and Van Gool(2017)] which is used by recent deep learning methods.

2 Discussion

Our motivation to use a part-based graph model is derived primarily from the fact that human actions are made up of “gestures” which represent motion of a body part. The seminal success of DPMs [Felzenszwalb and Huttenlocher(2005)] in detecting humans in images reinforces the motivation further. We discuss the effect of proposed spatio-temporal part-based graph model below.

(a) How many parts to have?

We start with a coarse-grained scheme where entire skeleton is a single part and progress towards finer representations. The different partitions are, two parts: dividing skeleton into axial and appendicular skeleton, four parts: as explained in section 4 and six parts: Assigning left and right in hands and legs. The feature at each vertex in the input is 3D coordinate of the corresponding joint. From Table 1(a), we can see that using two parts improves over one and four improves over two. This shows that partitioning the skeleton graph into subgraphs with useful properties helps. However, dividing upper and lower skeletons into left and right in four part scheme does not improve performance, as per our intuition about laterality mentioned in section 4. This experiment suggests that part-based model improves performance over single part and being agnostic to laterality is helpful. Our final model uses the four part division of the human skeleton.

(b) Comparison to graph-based models

From Table 2(a) and Table 1(b), it can be seen that our part-based model performs better than graph based model of Yan et al\bmvaOneDot[Yan et al.(2018)Yan, Xiong, and Lin] even when using $J_{loc}$ as the feature at each vertex. The graph construction in [Yan et al.(2018)Yan, Xiong, and Lin] uses a spatial partitioning scheme for their final model which divides the skeleton graph egde set into several partitions, while the vertex set has no partitions and contains all the joints. The difference in our model is that we divide the entire skeleton into smaller parts similar to human body parts and hence we use different edge set and vertex set for each part. Compared to graph based model of Li et al\bmvaOneDot[Li et al.(2018b)Li, Cui, Zheng, Xu, and Yang], our model performs significantly better on NTURGB+D as well as HDM05. However, it is possible that this is because the number of layers in the network in [Li et al.(2018b)Li, Cui, Zheng, Xu, and Yang] is much smaller (2 vs 9) compared to our model. Our model outperforms both the previous graph based models proposed for skeleton action recognition on the two datasets.

Geometric + Kinematic signals:

Providing an explicit cue to a convolutional network, such as optical flow when performing action recognition from RGB videos [Simonyan and Zisserman(2014)], which is significant for the task at hand helps learn a richer representation by focusing on the cue. This motivates the use of geometric and kinematic features for skeletal action recognition. For the final configuration of our model, we concatenate the geometric and kinematic signals.

(a) Kinematic: temporal displacements

Temporal displacements provide information about the amount of motion happening between two frames. This information is synonymous to 3D scene flow of a very sparse set of points. We hypothesize that these displacements provide explicit motion information (like optical flow) which makes the model consider displacements as strong features and learn from them. Improvement in performance using this signal can be seen from Table 1(b), for both four-part as well as one-part model across both splits of NTURGB+D.

(b) Geometric: relative coordinates

These provide translation invariant features as explained in [Verma et al.(2018)Verma, Boyer, and Verbeek] and they have been used effectively to encode skeletons by Ke et al\bmvaOneDot[Ke et al.(2017)Ke, Bennamoun, An, Sohel, and Boussaid] into images. Also, Zhang et al\bmvaOneDot[Zhang et al.(2017b)Zhang, Liu, and Xiao] used relative coordinates as a geometric feature which performs much better than 3D joint locations using a simple stacked LSTM network. We can see improvements in performance provided by relative coordinates in Table 1(b) for both global (one part) and four part-based models, which are the worst and best performing models according to Table 1(a).

3 Comparison to state of the art

For this dataset, we outperform all previous state-of-the-art methods by a large margin. Even without using the signals introduced in section 5, we outperform the previous methods which can be seen in Table 1(b) ( $J_{loc}$ results). We outperform the previous state-of-the-art graph based method of Yan et al\bmvaOneDot[Yan et al.(2018)Yan, Xiong, and Lin] (STGCN) which is also the state-of-the-art for skeleton based action recognition to the best of our knowledge, by a margin of ~6% and ~5% for the two protocols.

HDM05:

This is a ~20x smaller dataset compared to NTURGB+D but contains more than twice the number of classes in NTURGB+D. The length of sequences in this dataset is longer and some of the action classes have only one sequence [Cho and Chen(2014)]. Using the protocol of [Huang and Van Gool(2017)] is therefore very challenging, on which we obtain state-of-the-art results using our model. We outperform the previous state-of-the-art Deep STGC [Li et al.(2018b)Li, Cui, Zheng, Xu, and Yang], which is a network based on spectral graph convolutions for skeleton action recognition by ~3% at the mean accuracy.

Conclusion

In this paper, we define a partition of skeleton graph on which spatio-temporal convolutions are formalized through a part-based GCN for the task of action recognition. Such a part-based GCN learns the relations between parts and understands the importance of each part in human actions more effectively than a model that considers entire body as a single graph. We also demonstrate the benefit of giving explicit cues to the convolutional model which are significant from the point of view of the task at hand, such as relative coordinates and temporal displacements for skeletal action recognition. As a result, our model achieves state-of-the-art performance on two challenging action recognition datasets. As a future work, we would like to explore the use of part-based graph model for tasks other than action recognition, such as object detection, measuring image similarity, etc.