Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

Lin Sun, Kui Jia, Dit-Yan Yeung, Bertram E. Shi

Introduction

Human actions can be categorized by the visual appearance and motion dynamics of the involved humans and objects. The design of many popular human action recognition datasets is based on this intrinsic property. To recognize human actions in video sequences, computer vision researchers have been developing better visual features to characterize the spatial appearance and temporal motion . Since video sequences can naturally be viewed as three-dimensional (3D) spatio-temporal signals, many existing methods seek to develop different spatio-temporal features for representing spatially and temporally coupled action patterns . Thus far, although these methods are robust against some real-world human action conditions, when applied to more realistic, complex human actions, their performance often degrades significantly due to the large intra-category variations within action categories and inter-category ambiguities between action categories. A number of factors can cause large intra-category variations. Some major ones include large variations in visual appearance and motion dynamics of the constituent humans and objects, arbitrary illumination and imaging conditions, self-occlusion, and cluttered background. To address these challenges, some methods extract trajectories of interest points from video sequences to characterize the salient spatial regions and their motion dynamics. However, in general, the challenge of recognizing complex human actions has not been well addressed.

Most of the above methods use handcrafted features and relatively simple classifiers. More recently, the end-to-end approach of learning features directly from raw observations using deep architectures shows great promise in many computer vision tasks, including object detection , semantic segmentation and so forth. Using massive training datasets, these deep architectures are able to learn a hierarchy of semantically related convolution filters (or kernels), giving highly discriminative models and hence better classification accuracy . In fact, even directly applying these image-based models to individual frames of the videos has shown promising action recognition performance , because the learned features can better characterize the visual appearance in the spatial domain.

However, human actions in video sequences are 3D spatio-temporal signals. It is not surprising to expect that exploiting the temporal domain as well could further advance the state of the art. Some recent attempts have been made along this direction . The 3D CNN model learns convolution kernels in both space and time based on a straightforward extension of the established 2D CNN deep architectures to the 3D spatio-temporal domain. The methods in aim at learning long-range motion features by learning a hierarchy consisting of multiple layers of 3D spatio-temporal convolution kernels by early fusion, late fusion, or slow fusion. The two-stream CNN architecture learns motion patterns using an additional CNN which takes as input the optical flow computed from successive frames of video sequences. By using the optical flow to capture motion features, the two-stream CNN is less effective for characterizing long-range or “slow” motion patterns which may be more relevant to the semantic categorization of human actions . Possibly due to the increased complexity and difficulty of training 3D kernels without sufficient training video data (as compared to massive image datasets ), 3D CNN does not perform well even on the less challenging KTH dataset . For the UCF-101 benchmark dataset , we note that the results reported in are inferior to those obtained by two-stream CNN . Indeed, spatio-temporal action patterns coupling the visual appearance and motion dynamics generally need an order of magnitude more training data than the 2D spatial counterparts. Moreover, existing methods often overlook the issue of sequence alignment in which actions of different speeds and accelerations have to be handled properly for human action recognition.

The above analysis motivates us to consider alternative deep architectures which can handle 3D spatio-temporal signals more effectively. To this end, we propose a new deep architecture called factorized spatio-temporal convolutional networks ( $\textrm{F}_{\textrm{ST}}\textrm{CN}$ ). A schematic diagram of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ is shown in Figure 1. While details of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ will be presented in the next section, we summarize the key characteristics and main contributions of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ as follows.

$\textrm{F}_{\textrm{ST}}\textrm{CN}$ factorizes the original 3D spatio-temporal convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower network layers, called spatial convolutional layers (SCL), followed by learning 1D temporal kernels in the upper network layers, called temporal convolutional layers (TCL). This factorized scheme greatly reduces the number of network parameters to be learned, thus mitigating the compound difficulty of high kernel complexity and insufficient training video data.

We introduce a novel transformation and permutation (T-P) operator to form an intermediate layer of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ , as illustrated by the yellow box in Figure 1. The T-P operator facilitates learning of the temporal convolution kernels in the subsequent TCL.

To address the issue of sequence alignment, we propose a training and inference strategy based on sampling multiple video clips from a given action video sequence. Each video clip is produced by temporally sampling with a stride and spatially cropping from the same location of the given action video sequence. Using sampled video clips as inputs to $\textrm{F}_{\textrm{ST}}\textrm{CN}$ improves the robustness of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ against variations caused by sequence misalignment. This is similar in spirit to the data augmentation scheme commonly used for image classification .

In addition, we propose a novel score fusion scheme based on the sparsity concentration index (SCI). It puts more weights on the score vectors of class probability (output of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ ) that have higher degrees of sparsity. Experiments show that this score fusion scheme consistently improves over existing ones.

In summary, $\textrm{F}_{\textrm{ST}}\textrm{CN}$ is a cascaded deep architecture stacking multiple lower SCLs, a T-P operator, and an upper TCL. An additional SCL is also used in parallel with the TCL, aiming at learning a more abstract feature representation of spatial appearance. With the fully-connected (FC) and classifier layers on top, the whole $\textrm{F}_{\textrm{ST}}\textrm{CN}$ can be trained globally using back-propagation . Extensive experiments on benchmark human action recognition datasets show the efficacy of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ .

The proposed deep architecture

where $*$ denotes 3D convolution and $\mathbf{F}_{st}$ is the resulting spatio-temporal features. Ideally a learned kernel $\mathbf{K}$ encodes some primitive spatio-temporal action patterns such that the entire set of action patterns can be reconstructed from sufficiently many such kernels. However, as discussed in Section 1, learning a representative set of 3D spatio-temporal convolution kernels is practically challenging due to the compound difficulty of the high complexity of 3D kernels and insufficient training videos. This is in contrast with the problem of learning 2D spatial kernels for image classification . To overcome this challenge, we resort to approximating the 3D kernels by characterizing the spatio-temporal action patterns using lower-complexity kernels. Not only does this approach need less videos for training, but it can also take advantage of existing massive image datasets to train the spatial kernels.

Although equivalence does not hold in general, we exploit the computational advantage by restricting to a family of 3D kernels $\mathbf{K}$ which can be expressed in a factorized form as

The above analysis motivates us to design a novel deep architecture which factorizes multiple layers of the original 3D spatio-temporal convolutions as a sequential process involving 2D spatial convolutions followed by 1D temporal convolutions. More specifically, our proposed $\textrm{F}_{\textrm{ST}}\textrm{CN}$ consists of several spatial convolutional layers (SCL). The basic components of an SCL are 2D spatial convolution kernels Considering multiple feature channels or maps for each video frame, the convolution kernels in SCLs are in fact 3D kernels. To conceptually keep consistent with the 3D physical space, we choose to term the convolution kernels in SCLs as 2D kernels. The same reason applies to terming the convolution kernels in TCLs as 1D kernels., nonlinearity (ReLU), local response normalization and max-pooling, as illustrated in the black box of Figure 1. Each convolutional layer must include convolution and ReLU but normalization and max-pooling are optional. By processing individual frames of the video clips with the learned 2D spatial kernels, the SCLs are able to extract compact and discriminative features for the visual appearance.

To characterize the motion patterns, $\textrm{F}_{\textrm{ST}}\textrm{CN}$ further stacks a temporal convolutional layer (TCL) on top of the SCLs. The TCL has the same basic components as those of the SCLs. In order to learn the motion patterns that evolve over time, a layer called the T-P operator is inserted between the SCLs and TCL, as illustrated in the yellow box of Figure 1. Taking data in the form of 4D arrays (horizontal $x$ , vertical $y$ , temporal $t$ , and feature channel $f$ as dimensions) as input, this T-P operator first vectorizes individual matrices of the 4D arrays along the horizontal and vertical dimensions such that each matrix of size $x\times y$ becomes a vector of length $x\times y$ , and then rearranges the dimensions of the resulting 3D arrays (the transformation operation) so that 2D convolution kernels can be learned and applied along the temporal and feature-channel dimensions (i.e., the 1D temporal convolution in TCL).We term the convolution in TCL as 1D temporal convolution to conceptually keep it consistent with the 3D physical space. This convolution is in fact 2D convolution along the temporal and feature-channel dimensions. Note that the transformation operation is optional; our introduction of it in the T-P operator is conceptually to make temporal convolution in the subsequent TCL explicit, and practically to make the implementation of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ compatible with the popular deep learning libraries . Vectorization and transformation are followed by a permutation operation, via a permutation matrix $P$ with a size of $f\times f^{{}^{\prime}}$ , along the dimension of feature channels. It aims to reorganize the output of feature channels so that convolution in the subsequent TCL takes a better support of local 2D windows in the feature-channel and temporal directions. As tunable network parameters, $P$ is initialized from Gaussian distribution and learned, in the same way as other network parameters, via back-propagation. Consequently, it reorganizes the feature channels by generating $f^{{}^{\prime}}$ new feature maps via weighted combination of the $f$ input feature maps. Since TCL takes as input the output of the T-P operator, which in turn takes as input the intermediate feature maps of all frames of the input video clip (i.e., output of the SCLs), a pixel location in the vectorized spatial domain of TCL corresponds to a larger receptive field of the input video clip. In other words, TCL’s 1D temporal convolution kernels essentially feature the motion patterns constituted by the visual appearance of local regions of the input video clip that evolves over time. When combined with our proposed video clip sampling strategy (to be presented in Section 2.2), they capture long-range motion patterns of relatively holistic visual appearance at a cheaper learning cost. Details of the TCL are presented in the purple box of Figure 1. Two parallel convolutions with different kernel sizes are applied to the TCL and then concatenated together to represent the temporal characteristics. Dropout follows each ReLU, respectively, to reduce overfitting. Two advantages can be observed. First, as stated before, actions can be performed at different speeds or with varying accelerations, that is, the “slow” ones (long time duration) can be captured using the large kernel while the “fast” ones (short time duration) using the small kernel. What is more, the parallel convolutional layers can provide more motion properties which will definitely benefit the action recognition task.

In $\textrm{F}_{\textrm{ST}}\textrm{CN}$ , an additional SCL is also used in parallel with TCL, aiming at learning a more abstract feature representation of visual appearance. Similar to a convolutional layer in conventional CNNs, this SCL improves the spatial invariance of action categories by extracting salient/discriminative appearance features via the learned spatial filters and the subsequent nonlinearity and pooling operations. Two fully-connected layers are stacked on top of the parallel TCL and SCL, which are then concatenated as the spatio-temporal features. Finally, a FC layer and a softmax classification layer are further cascaded for supervised training by standard back-propagation. The whole architecture of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ with specific layer components is presented in Figure 1.

2 Data augmentation by sampling video clips

Human actions are visual signals contained in video sequences. Some of them are also periodic with multiple action cycles repeating over time. It is usually a pre-requisite step to detect and align action instances from the containing video sequences, in order to compare action instances performed in different speeds or with varying accelerations. This issue of action sequence detection/alignment is traditionally addressed by sliding windows across the temporal direction, dynamic time warping , or detecting trajectories of interest points from video sequences . However, it is generally overlooked in the more recent deep learning based action recognition methods .

Instead of deliberately detecting and aligning action instances from video sequences, we propose in this paper a training and inference strategy based on sampling multiple video clips from a given video sequence. Note that our proposed scheme is different from the bag of video words , which extracts spatio-temporal features from the video cuboid. Each video clip in our scheme is produced by temporally sampling with a stride and spatially cropping from the same location of the given video sequence, as illustrated in Figure 2. Such sampled video clips are not guaranteed to be aligned with action cycles. However, the motion pattern is well kept in the sampled video clips if enough time duration is given. We use them to train $\textrm{F}_{\textrm{ST}}\textrm{CN}$ in a supervised manner, and expect representations robust to the misalignment can be learned at the upper network layers. Besides, even some misalignment exists, since our TCL learns the kernel along the feature and temporal dimensions, the discriminative motion patterns can still be preserved in series.

We sample multiple pairs of video clips $\{\mathbf{V}_{clip},\mathbf{V}_{clip}^{diff}\}$ , and use the sampled video clip pairs as inputs of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ . Our sampling strategy is spiritually similar to data augmentation commonly used in image classification , where results have demonstrated that such a strategy is able to reduce overfitting of network parameters by largely increasing training samples, and also to improve robustness of the learned networks against misalignment of object instances in images. Our proposed video clip sampling strategy extends data augmentation to the temporal domain, aiming to address the issue of sequence alignment in the problem of video based human action recognition.

Considering that $\mathbf{V}_{clip}^{diff}$ contains short-range and long-range motion information, and $\mathbf{V}_{clip}$ (mostly) contains information of visual appearance, our use of the sampled video clip pairs in $\textrm{F}_{\textrm{ST}}\textrm{CN}$ is as follows. We first feed individual frames of each pair of $\mathbf{V}_{clip}$ and $\mathbf{V}_{clip}^{diff}$ into the lower SCLs. After mid-level spatial feature representations of these individual frames are extracted, we separate these mid-level features by feeding those from all frames of $\mathbf{V}_{clip}^{diff}$ into TCL (after T-P operator), and feeding a randomly sampled frame of $\mathbf{V}_{clip}$ into the intermediate SCL that is parallel to the TCL. When testing, the selected middle one of $\mathbf{V}_{clip}$ is fed into SCL. This separation of signal pipelines is consistent with our architecture design of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ .

3 Learning and inference

$\textrm{F}_{\textrm{ST}}\textrm{CN}$ has factorized SCLs and TCLs. To learn spatial and temporal convolution kernels effectively, we follow the ideas from and introduce auxiliary classifier layers connected to the lower SCLs, as illustrated in the dashed red box of Figure 1. In practice, we first use ImageNet to pre-train this auxiliary network, and use randomly sampled training video frames to fine-tune it, in order to get better 2D spatial convolution kernels in the lower SCLs. We follow the advice in by only fine-tuning the last three layers of the auxiliary network as well. Finally the whole $\textrm{F}_{\textrm{ST}}\textrm{CN}$ network is globally trained via back-propagation, using sampled pairs of video clips as inputs. Note that in our training of TCL in $\textrm{F}_{\textrm{ST}}\textrm{CN}$ , we do not use video data additional to the currently working action dataset; in contrast, additional training videos from a second action dataset are used for training in two-stream CNNs .

In the inference stage, given a test action sequence, we first sample pairs of video clips as explained in Section 2.2. We then pass each of the sampled video clip pairs through the $\textrm{F}_{\textrm{ST}}\textrm{CN}$ pipeline, resulting in a score output of class probability. These scores are fused to get the final recognition result, for which we propose a novel score fusion scheme that will be introduced shortly.

4 SCI based score fusion

where $p_{j}$ denotes the $j^{th}$ entry of $\mathbf{p}$ , and $\textrm{SCI}(\mathbf{p})\in$ . Given $C$ cropped videos of the $i^{th}$ sampled video clip pair and their score outputs $\{\mathbf{p}_{k,i}\}_{k=1}^{C}$ , our proposed SCI based score fusion scheme computes the final score of class probability as

The $M$ pairs of video clips are fused by

The test video sequence is finally recognized as the action category that has the corresponding largest entry value of $\hat{\mathbf{p}}$ , i.e., $\arg\max_{j=1,\dots,N}\hat{p}_{j}$ with $\hat{p}_{j}$ as the $j^{th}$ entry of $\hat{\mathbf{p}}$ . Our score fusion scheme also provides the compensation of the misalignment problem since maximized values of video clips are taken.

We illustrate in Figure 3 the idea of our proposed SCI based score fusion scheme. Experiments in Section 3 show that it consistently improves over the commonly used averaging scheme. We expect our proposed scheme is also useful in other deep learning based classification methods.

5 Implementation details

The details of the first four SCLs which extract the compact and discriminative feature representations of visual appearance are: $Conv(96,7,2)-ReLU-Norm-Pooling(3,2)-Conv(256,5,2)-ReLU-Norm-Pooling(3,2)-Conv(512,3,1)-Conv(512,3,1)$ , where $Conv(c_{f},c_{k},c_{s})$ denotes the a convolutional layer with $c_{f}$ feature maps and the kernel size is $c_{k}\times c_{k}$ , applied to the input with the stride $c_{s}$ in width and height direction. $Norm$ and $Pooling(p_{k},p_{s})$ are the local response normalization layer and pooling layer in which $p_{k}$ is the spatial window and $p_{s}$ is the pooling stride, respectively. Similar as rectified activation functions (ReLU) are applied to all the hidden weights layers; max-pooling is performed over $3\times 3$ spatial windows with stride 2; local response normalization across channels uses the same settings as , that is: $k=2,n=5,\alpha=5\times 10^{-4},\beta=0.75$ . The SCL which connects to the TCL contains convolution ( $Conv(128,3,1)$ ) and pooling ( $Pooling(3,3)$ ). The permutation matrix $P$ has the size of $128\times 128$ , that is, $f=f^{{}^{\prime}}=128$ . The TCL has two parallel convolutional layers ( $Conv(32,3,1)$ and $Conv(32,5,1)$ ), each of which has a dropout layer with a dropout probability of 0.5. Note that the TCL will not be followed by pooling layer since pooling will ruin the temporal cues. Two fully-connected (FC) layers with 4096 and 2048 hidden nodes respectively are stacked on top of the TCL and SCL. The feature outputs from the fully connection layer of SCL and TCL are then concatenated and passed through another FC layer with 2048 hidden nodes. For training, we use a batch size of 32, momentum of 0.9, and weight decay of 0.0005. Instead of using the popular input size of $224\times 224$ , we use the size of $204\times 204$ to save memory. Note that the settings of the spatial convolutional path are the same as in except that the input size becomes smaller. At each training iteration, the frames in each pair of video clips are randomly cropped at the same location and flipped simultaneously.

Experiments

Experiments are conducted on two benchmark action recognition datasets, namely, UCF-101 and HMDB-51 , which are the largest and most challenging annotated action recognition datasets.

UCF-101 is composed of realistic web videos, which are typically captured with camera motions and under various illuminations, and contain partial occlusion. It has 101 categories of human actions, ranging from daily life to unusual sports (such as “Yo Yo”). UCF-101 has more than 13K videos with an average length of 180 frames per video. It has 3 split settings to separate the dataset into training and testing videos. We report the mean classification accuracy over these three splits.

HMDB-51 has a total of 6766 videos organized as 51 distinct action categories, which are collected from a wide range of sources. This dataset is more challenging than others because it has more complex backgrounds and context environments. What is more, there are many similar scenes in different categories. Since the number of training videos is small in this dataset, it is more challenging to learn representative features. Similar to UCF-101, HMDB-51 also has 3 split settings to separate the dataset into training and testing videos, and we report the mean classification accuracy over these three splits.

In the experiments, the video clip consists of $5$ temporally sampled pairs of video clips $\{\mathbf{V}_{clip},\mathbf{V}_{clip}^{diff}\}$ with $d_{t}=9$ and $s_{t}=5$ . These sampled video clips are representative enough to convey the long-range motion dynamics. Firstly, we use the TCL path of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ (orange arrows in Figure 1) and split 1 of HMDB51 to investigate whether using two different convolution kernels in TCL is better than using one kernel only. The sizes of the two different kernels are $3\times 3$ and $5\times 5$ respectively. Results of this investigation are reported in Table 1. Table 1 tells that using two different kernels is better than using either one of them, and using kernel of a larger size is better than using that of a smaller size. Note that these results are obtained without using our score fusion scheme. In Table 1, we also compare with in the setting of using temporal convolution pipeline only, i.e., using the TCL path with $\mathbf{V}_{clip}^{diff}$ as the input for $\textrm{F}_{\textrm{ST}}\textrm{CN}$ and the optical flow CNN stream for . Our result ( $48.4\%$ ) of TCL path is better than that ( $46.6\%$ ) of optical flow CNN stream in when only split 1 of HMDB-51 is used as training videos. We note that auxiliary training videos are also used in to boost performance, as shown in Table 1. Using auxiliary training videos is complementary to our proposed technique of factorized SCL and TCL. We expect our result can be further improved given auxiliary training videos.

We present controlled experiments on the UCF-101 and HMDB-51 datasets in Table 2, where results from different proposed contributions are specified. Table 2 tells that our proposed data augmentation scheme by sampling video clips, and also the SCI based score fusion scheme effectively improve the recognition performance. In particular, when features from SCL path and TCL path are concatenated and trained globally via back-propagation, about $10\%$ gain can be obtained, indicating that our learned spatio-temporal features are complementary with each other. Results from our main contribution of $\textrm{F}_{\textrm{ST}}\textrm{CN}$ will be presented shortly by comparing with the state-of-the-art.

Table 3 compares $\textrm{F}_{\textrm{ST}}\textrm{CN}$ with other state-of-the-art methods, where performance is measured by mean accuracy on three splits of the HMDB51 and UCF101 datasets. Compared with the state-of-the-art CNN based method , our method outperforms it by about $1\%$ on both datasets, when averaging fusion is adopted. When a supervised learning based SVM score fusion scheme is used in , our method still achieves better or comparable performance on the two datasets. We note that the results of and are obtained by using auxiliary training videos, while our results are obtained by using each of the working datasets only. We expect our results can be further boosted given auxiliary training videos.

Visualization

To visually verify the relevance of learned parameters in $\textrm{F}_{\textrm{ST}}\textrm{CN}$ , we use back-propagation to visualize important regions for any action category, i.e., back-propagating the neuron of that action category in the classifier layer to the input image domain. Figure 5 gives illustration for several action categories. The shown “saliency” maps suggest that learned parameters in $\textrm{F}_{\textrm{ST}}\textrm{CN}$ can capture the most representative regions of action categories. For example, the saliency map of the action “smile” displays a “monster” face, suggesting that “smile” happens around the mouth.

To investigate whether our learned spatio-temporal features are discriminative for action recognition, we plot in Figure 4 the learned features of several action categories (“smile”, “laugh”, “chew”, “talk”, “eat” , “smoke”, and “drink” in the HMDB-51 dataset). These features are visualized using the dimensionality reduction method tSNE . Since these action categories are mainly concerned with face motions, especially with mouth movements, they cannot be easily distinguished. Figure 4 clearly shows that spatio-temporal features extracted from the FC layer after SCL and TCL being concatenated, are more discriminative than either spatial features extracted from the second FC layer of SCL, or temporal features extracted from the second FC layer of TCL.

Conclusion

In this paper, a novel deep learning architecture, termed $\textrm{F}_{\textrm{ST}}\textrm{CN}$ , is proposed for action recognition. The $\textrm{F}_{\textrm{ST}}\textrm{CN}$ is a cascaded deep architecture which learns the effective spatio-temporal features through training using standard back-propagation. This factorization design mitigates the compound difficulty of high kernel complexity and insufficient training videos. The T-P operator provides a novel feature and temporal representation for actions. Moreover, two parallel kernels in the TCL assists it to learn more representative temporal features. In addition, the additional SCL extracts more abstract spatial appearance which largely compensates the deficiency of TCL as shown in the experimental results. Extensive experiments on the action benchmark datasets present the superiority of our algorithm even without additional training videos.

Acknowledgment

This research has been partially supported by Faculty Research Award Z0400-D granted to Dit-Yan Yeung and the National Natural Science Foundation of China (Grant No. 61202158).