History Repeats Itself: Human Motion Prediction via Motion Attention

Wei Mao, Miaomiao Liu, Mathieu Salzmann

Introduction

Human motion prediction consists of forecasting the future poses of a person given a history of their previous motion. Predicting human motion can be highly beneficial for tasks such as human tracking , human-robot interaction , and human motion generation for computer graphics . To tackle the problem effectively, recent approaches use deep neural networks to model the temporal historical data.

Traditional methods, such as hidden Markov models and Gaussian Process Dynamical Models , have proven effective for simple motions, such as walking and golf swings. However, they are typically outperformed by deep learning ones on more complex motions. The most common trend in modeling the sequential data that constitutes human motion consists of using Recurrent Neural Networks (RNNs). However, as discussed in , in the mid- to long-term horizon, RNNs tend to generate static poses because they struggle to keep track of long-term history. To tackle this problem, existing works either rely on Generative Adversarial Networks (GANs), which are notoriously hard to train , or introduce an additional long-term encoder to represent information from the further past . Unfortunately, such an encoder treats the entire motion history equally, thus not allowing the model to put more emphasis on some parts of the past motion that better reflect the context of the current motion.

In this paper, by contrast, we introduce an attention-based motion prediction approach that effectively exploits historical information by dynamically adapting its focus on the previous motions to the current context. Our method is motivated by the observation that humans tend to repeat their motion, not only in short periodical activities, such as walking, but also in more complex actions occurring across longer time periods, such as sports and cooking activities .Therefore, we aim to find the relevant historical information to predict future motion.

To the best of our knowledge, only has attempted to leverage attention for motion prediction. This, however, was achieved in a frame-wise manner, by comparing the human pose from the last observable frame with each one in the historical sequence. As such, this approach fails to reflect the motion direction and is affected by the fact that similar poses may appear in completely different motions. For instance, in most Human3.6M activities, the actor will at some point be standing with their arm resting along their body. To overcome this, we therefore propose to model motion attention, and thus compare the last visible sub-sequence with a history of motion sub-sequences.

To this end, inspired by , we represent each sub-sequence in trajectory space using the Discrete Cosine Transform (DCT). We then exploit our motion attention as weights to aggregate the entire DCT-encoded motion history into a future motion estimate. This estimate is combined with the latest observed motion, and the result acts as input to a graph convolutional network (GCN), which lets us better encode spatial dependencies between different joints.As evidenced by our experiments on Human3.6M , AMASS , and 3DPW , and illustrated in Fig. 1, our motion attention-based approach consistently outperforms the state of the art on short-term and long-term motion prediction by training a single unified model for both settings. This contrasts with the previous-best model LTD , which requires training different models for different settings to achieve its best performance. Furthermore, we demonstrate that it can effectively leverage the repetitiveness of motion in longer sequences.

Our contributions can be summarized as follows. (i) We introduce an attention-based model that exploits motions instead of static frames to better leverage historical information for motion prediction; (ii) Our motion attention allows us to train a unified model for both short-term and long-term prediction; (iii) Our approach can effectively make use of motion repetitiveness in long-term history; (iv) It yields state-of-the-art results and generalizes better than existing methods across datasets and actions.

Related Work

RNN-based human motion prediction. RNNs have proven highly successful in sequence-to-sequence prediction tasks . As such, they have been widely employed for human motion prediction . For instance, Fragkiadaki et al. proposed an Encoder-Recurrent-Decoder (ERD) model that incorporates a non-linear multi-layer feedforward network to encode and decode motion before and after recurrent layers. To avoid error accumulation, curriculum learning was adopted during training. In , Jain et al. introduced a Structural-RNN model relying on a manually-designed spatio-temporal graph to encode motion history. The fixed structure of this graph, however, restricts the flexibility of this approach at modeling long-range spatial relationships between different limbs. To improve motion estimation, Martinez et al. proposed a residual-based model that predicts velocities instead of poses. Furthermore, it was shown in this work that a simple zero-velocity baseline, i.e., constantly predicting the last observed pose, led to better performance than . While this led to better performance than the previous pose-based methods, the predictions produced by the RNN still suffer from discontinuities between the observed poses and predicted ones. To overcome this, Gui et al. proposed to adopt adversarial training to generate smooth sequences . In , Ruiz et al. treat human motion prediction as a tensor inpainting problem and exploit a generative adversarial network for long-term prediction. While this approach further improves performance, the use of an adversarial classifier notoriously complicates training , making it challenging to deploy on new datasets.

Feed-forward methods and long motion history encoding. In view of the drawbacks of RNNs, several works considered feed-forward networks as an alternative solution . In particular, in , Butepage et al. introduced a fully-connected network to process the recent pose history, investigating different strategies to encode temporal historical information via convolutions and exploiting the kinematic tree to encode spatial information. However, similar to , and as discussed in , the use of a fixed tree structure does not reflect the motion synchronization across different, potentially distant, human body parts. To capture such dependencies, Li et al. built a convolutional sequence-to-sequence model processing a two-dimensional pose matrix whose columns represent the pose at every time step. This model was then used to extract a prior from long-term motion history, which, in conjunction with the more recent motion history, was used as input to an autoregressive network for future pose prediction. While more effective than the RNN-based frameworks, the manually-selected size of the convolutional window highly influences the temporal encoding.

Our work is inspired by that of Mao et al. , who showed that encoding the short-term history in frequency space using the DCT, followed by a GCN to encode spatial and temporal connections led to state-of-the-art performance for human motion prediction up to 1s. As acknowledged by Mao et al. , however, encoding long-term history in DCT yields an overly-general motion representation, leading to worse performance than using short-term history. In this paper, we overcome this drawback by introducing a motion attention based approach to human motion prediction. This allows us to capture the motion recurrence in the long-term history. Furthermore, in contrast to , whose encoding of past motions depends on the manually-defined size of the temporal convolution filters, our model dynamically adapts its history-based representation to the context of the current prediction.

Attention models for human motion prediction. While attention-based neural networks are commonly employed for machine translation , their use for human motion prediction remains largely unexplored. The work of Tang et al. constitutes the only exception, incorporating an attention module to summarize the recent pose history, followed by an RNN-based prediction network. This work, however, uses frame-wise pose-based attention, which may lead to ambiguous motion, because static poses do not provide information about the motion direction and similar poses occur in significantly different motions. To overcome this, we propose to leverage motion attention. As evidenced by our experiments, this, combined with a feed-forward prediction network, allows us to outperform the state-of-the-art motion prediction frameworks.

Our Approach

As humans tend to repeat their motion across long time periods, our goal is to discover sub-sequences in the motion history that are similar to the current sub-sequence. In this paper, we propose to achieve this via an attention model.

Following the machine translation formalism of , we describe our attention model as a mapping from a query and a set of key-value pairs to an output. The output is a weighted sum of values, where the weight, or attention, assigned to each value is a function of its corresponding key and of the query. Mapping to our motion attention model, the query corresponds to a learned representation of the last observed sub-sequence, and the key-value pairs are treated as a dictionary within which keys are learned representations for historical sub-sequences and values are the corresponding learned future motion representations. Our motion attention model output is defined as the aggregation of these future motion representations based on partial motion similarity between the latest motion sub-sequence and historical sub-sequences.

In our context, we aim to compute attention from short sequences. To this end, we first divide the motion history ${\bf X}_{1:N}=[{\bf x}_{1},{\bf x}_{2},{\bf x}_{3},\cdots,{\bf x}_{N}]$ into $N-M-T+1$ sub-sequences $\{{\bf X}_{i:i+M+T-1}\}_{i=1}^{N-M-T+1}$ , each of which consists of $M+T$ consecutive human poses. By using sub-sequences of length $M+T$ , we assume that the predictor, which we will introduce later, exploits the past $M$ frames to predict the future $T$ frames. We then take the first $M$ poses of each sub-sequence ${\bf X}_{i:i+M-1}$ to be a key, and the whole sub-sequence ${\bf X}_{i:i+M+T-1}$ is then the corresponding value. Furthermore, we define the query as the latest sub-sequence ${\bf X}_{N-M+1:N}$ with length $M$ .

Note that, instead of the softmax function which is commonly used in attention mechanisms, we simply normalize the attention scores by their sum, which we found to avoid the gradient vanishing problem that may occur when using a softmax. While this division only enforces the sum of the attention scores to be $1$ , we further restrict the outputs of $f_{q}$ and $f_{k}$ to be non-negative with ReLU to avoid obtaining negative attention scores.

We then compute the output of the attention model as the weighed sum of values. That is,

2 Prediction Model

To predict the future motion, we use the state-of-the-art motion prediction model of . Specifically, as mentioned above, we use a DCT-based representation to encode the temporal information for each joint coordinate or angle and GCNs with learnable adjacency matrices to capture the spatial dependencies among these coordinates or angles.

Temporal encoding. Given a sequence of $k^{th}$ joint coordinates or angles $\{x_{k,l}\}_{l=1}^{L}$ or its DCT coefficients $\{C_{k,l}\}_{l=1}^{L}$ , the DCT and Inverse-DCT (IDCT) are,

where $l\in\{1,2,\cdots,L\}$ , $n\in\{1,2,\cdots,L\}$ and $\delta_{ij}=\begin{cases}1&\text{if}\ i=j\\ 0&\text{if}\ i\neq j.\end{cases}$ .

Given ${\bf D}$ and ${\bf U}$ , the predictor learns a residual between the DCT coefficients ${\bf D}$ of the padded sequence and those of the true sequence. By applying IDCT to the predicted DCT coefficients, we obtain the coordinates or angles $\hat{{\bf X}}_{N-M+1:N+T}$ , whose last $T$ poses $\hat{{\bf X}}_{N+1:N+T}$ are predictions in the future.

3 Training

Let us now introduce the loss functions we use to train our model on either 3D coordinates or joint angles. For 3D joint coordinates prediction, following , we make use of the Mean Per Joint Position Error (MPJPE) proposed in . In particular, for one training sample, this yields the loss

where $\hat{x}_{t,k}$ is the predicted $k^{th}$ angle of the $t^{th}$ pose in $\hat{{\bf X}}_{N-M+1:N+T}$ and $x_{t,k}$ is the corresponding ground truth.

4 Network Structure

As shown in Fig. 2, our complete framework consists of two modules: a motion attention model and a predictor. For the attention model, we use the same architecture for $f_{q}$ and $f_{k}$ . Specifically, we use a network consisting of two 1D convolutional layers, each of which is followed by a ReLU activation function. In our experiments, the kernel size of these two layers is 6 and 5, respectively, to obtain a receptive field of 10 frames. The dimension of the hidden features, the query vector q and the key vectors $\{\textbf{k}_{i}\}_{i=1}^{N-M-T+1}$ is set to 256.

For the predictor, we use the same GCN with residual structure as in . It is made of 12 residual blocks, each of which contains two graph convolutional layers, with an additional initial layer to map the DCT coefficients to features and a final layer to decode the features to DCT residuals. The learnable weight matrix W of each layer is of size $256\times 256$ , and the size of the learnable adjacency matrix A depends on the dimension of one human pose. For example, for 3D coordinates, A is of size $66\times 66$ . Thanks to the simple structure of our attention model, the overall network remains still compact. Specifically, in our experiments, it has around 3.4 million parameters for both 3D coordinates and angles. The implementation details are included in supplementary material.

Experiments

Following previous works , we evaluate our method on Human3.6m (H3.6M) and AMASS . We further evaluate our method on 3DPW using our model trained on AMASS to demonstrate the generalizability of our approach. Below, we discuss these datasets, the evaluation metric and the baseline methods, and present our results using joint angles and 3D coordinates.

Human3.6M is the most widely used benchmark dataset for motion prediction. It depicts seven actors performing 15 actions. Each human pose is represented as a 32-joint skeleton. We compute the 3D coordinates of the joints by applying forward kinematics on a standard skeleton as in . Following , we remove the global rotation, translation and constant angles or 3D coordinates of each human pose, and down-sample the motion sequences to 25 frames per second. As previous work , we test our method on subject 5 (S5). However, instead of testing on only 8 random sub-sequences per action, which was shown in to lead to high variance, we report our results on 256 sub-sequences per action when using 3D coordinates. For fair comparison, we report our angular error on the same 8 sub-sequences used in . Nonetheless, we provide the angle-based results on 256 sub-sequences per action in the supplementary material.

AMASS. The Archive of Motion Capture as Surface Shapes (AMASS) dataset is a recently published human motion dataset, which unifies many mocap datasets, such as CMU, KIT and BMLrub, using a SMPL parameterization to obtain a human mesh. SMPL represents a human by a shape vector and joint rotation angles. The shape vector, which encompasses coefficients of different human shape bases, defines the human skeleton. We obtain human poses in 3D by applying forward kinematics to one human skeleton. In AMASS, a human pose is represented by 52 joints, including 22 body joints and 30 hand joints. Since we focus on predicting human body motion, we discard the hand joints and the 4 static joints, leading to an 18-joint human pose. As for H3.6M, we down-sample the frame-rate to 25Hz.

Since most sequences of the official testing splitDescribed at https://github.com/nghorbani/amass of AMASS consist of transition between two irrelevant actions, such as dancing to kicking, kicking to pushing, they are not suitable to evaluate our prediction algorithms, which assume that the history is relevant to forecast the future. Therefore, instead of using this official split, we treat BMLrubAvailable at https://amass.is.tue.mpg.de/dataset. (522 min. video sequence), as our test set as each sequence consists of one actor performing one type of action. We then split the remaining parts of AMASS into training and validation data.

3DPW. The 3D Pose in the Wild dataset (3DPW) consists of challenging indoor and outdoor actions. We only evaluate our model trained on AMASS on the test set of 3DPW to show the generalization of our approach.

2 Evaluation Metrics and Baselines

Metrics. For the models that output 3D positions, we report the Mean Per Joint Position Error (MPJPE) in millimeter, which is commonly used in human pose estimation. For those that predict angles, we follow the standard evaluation protocol and report the Euclidean distance in Euler angle representation.

Baselines. We compare our approach with two RNN-based methods, Res. sup. and MHU , and two feed-forward models, convSeq2Seq and LTD , which constitutes the state of the art. The angular results of Res. sup. , convSeq2Seq and MHU on H3.6M are directly taken from the respective paper. For the other results of Res. sup. and convSeq2Seq , we adapt the code provided by the authors for H3.6M to 3D and AMASS. For LTD , we rely on the pre-trained models released by the authors for H3.6M, and train their model on AMASS using their official code. While Res. sup. , convSeq2Seq and MHU are all trained to generate 25 future frames, LTD has 3 different models, which we refer to as LTD-50-25 , LTD-10-25 , and LTD-10-10 . The two numbers after the method name indicate the number of observed past frames and that of future frames to predict, respectively, during training. For example, LTD-10-25 means that the model is trained to take the past 10 frames as input to predict the future 25 frames.

3 Results

Following the setting of our baselines , we report results for short-term ( $<500ms$ ) and long-term ( $>500ms$ ) prediction. On H3.6M, our model is trained using the past $50$ frames to predict the future $10$ frames, and we produce poses further in the future by recursively applying the predictions as input to the model. On AMASS, our model is trained using the past $50$ frames to predict the future $25$ frames.

Human3.6M. In Tables 1 and 2, we provide the H3.6M results for short-term and long-term prediction in 3D space, respectively. Note that we outperform all the baselines on average for both short-term and long-term prediction. In particular, our method yields larger improvements on activities with a clear repeated history, such as “Walking” and “Walking Together”. Nevertheless, our approach remains competitive on the other actions. Note that we consistently outperform LTD-50-25, which is trained on the same number of past frames as our approach. This, we believe, evidences the benefits of exploiting attention on the motion history.

Let us now focus on the LTD baseline, which constitutes the state of the art. Although LTD-10-10 is very competitive for short-term prediction, when it comes to generate poses in the further future, it yields higher average error, i.e., $114.0mm$ at $1000ms$ . By contrast, LTD-10-25 and LTD-50-25 achieve good performance at $880ms$ and above, but perform worse than LTD-10-10 at other time horizons. Our approach, however, yields state-of-the-art performance for both short-term and long-term predictions. To summarize, our motion attention model improves the performance of the predictor for short-term prediction and further enables it to generate better long-term predictions. This is further evidenced by Tables 3 and 4, where we report the short-term and long-term prediction results in angle space on H3.6M, and by the qualitative comparison in Fig. 3. More qualitative results are provided in the supplementary material.

AMASS & 3DPW. The results of short-term and long-term prediction in 3D on AMASS and 3DPW are shown in Table 5. Our method consistently outperforms baseline approaches, which further evidences the benefits of our motion attention model. Since none of the methods were trained on 3DPW, these results further demonstrate that our approach generalizes better to new datasets than the baselines.

Visualisation of attention. In Fig. 4, we visualize the attention maps computed by our motion attention model on a few sampled joints for their corresponding coordinate trajectories. In particular, we show attention maps for joints in a periodical motion (“Walking”) and a non-periodical one (“Discussion”). In both cases, the attention model can find the most relevant sub-sequences in the history, which encode either a nearly identical motion (periodical action), or a similar pattern (non-periodical action).

Motion repeats itself in longer-term history. Our model, which is trained with fixed-length observations, can nonetheless exploit longer history at test time if it is available. To evaluate this and our model’s ability to capture long-range motion dependencies, we manually sampled $100$ sequences from the test set of H3.6M, in which similar motion occurs in the further past than that used to train our model.

In Table 6, we compare the results of a model trained with 50 past frames and using either $50$ frames (Ours-50) or 100 frames (Ours-100) at test time. Although the performance is close in the very short term ( $<160ms$ ), the benefits of our model using longer history become obvious when it comes to further future, leading to a performance boost of $4.2mm$ at $1s$ . In Fig. 5, we compare the attention maps and predicted joint trajectories of Ours-50 (a) and Ours-100 (b). The highlighted regions (in red box) in the attention map demonstrate that our model can capture the repeated motions in the further history if it is available during test and improve the motion prediction results.

To show the influence of further historical frames, we replace the past $40$ frames with a static pose, thus removing the motion in that period, and then perform prediction with this sequence. As shown in Fig. 5 (c), attending to the similar motion between frames $-80$ and $-60$ , yields a trajectory much closer to the ground truth than only attending to the past $50$ frames.

Conclusion

In this paper, we have introduced an attention-based motion prediction approach that selectively exploits historical information according to the similarity between the current motion context and the sub-sequences in the past. This has led to a predictor equipped with a motion attention model that can effectively make use of historical motions, even when they are far in the past. Our approach achieves state-of-the-art performance on the commonly-used motion prediction benchmarks and on recently-published datasets. Furthermore, our experiments have demonstrated that our network generalizes to previously-unseen datasets without re-training or fine-tuning, and can handle longer history than that it was trained with to further boost performance on non-periodical motions with repeated history. In the future, we will further investigate the use of our motion attention mechanisms to discover human motion patterns in body parts level such as legs and arms to get more flexible attentions and explore new prediction frame works.

Acknowledgements

This research was supported in part by the Australia Research Council DECRA Fellowship (DE180100628) and ARC Discovery Grant (DP200102274). The authors would like to thank NVIDIA for the donated GPU (Titan V).

History Repeats Itself: Human Motion Prediction via Motion Attention —–Supplementary Material—–

Datasets

Below we provide more details about the datasets used in our experiments.

Human3.6M. As in , we use the skeleton of the subject 1 (S1) of Human3.6M as standard skeleton to compute the 3D joint coordinates from the joint angle representation. After removing the global rotation, translation and constant angles or 3D coordinates of each human pose, this leaves us with a 48 dimensional vector and a 66 dimensional vector for human pose in angle representation and 3D position, respectively. As in , the rotation angles are represented as exponential maps. During training, we set aside subject 11 (S11) as our validation set to choose the model that achieves the best performance across all future frames, and the remaining 5 subjects (S1,S6,S7,S8,S9) are used as training set.

AMASS & 3DPW. The human skeleton in AMASS and 3DPW is defined by a shape vector. In our experiment, we obtain the 3D joint positions by applying forward kinematic on the skeleton derived from the shape vector of the CMU dataset.

As specified in the main paper, we evaluate the model on BMLrub and 3DPW. Each video sequence is first downsampled to 25 frames per second, and evaluate on sub-sequences of length $M+T$ that start from every $5^{th}$ frame of each video sequence.

Implementation Details

We implemented our network in Pytorch and trained it using the ADAM optimizer . We use a learning rate of $0.0005$ with a decay at every epoch so as to make the learning rate be $0.00005$ at the $50^{th}$ epoch. We train our model for $50$ epochs with a batch size of 32 for H3.6M and 128 for AMASS. One forward and backward pass takes 32ms for H3.6M and 45ms for AMASS on an NVIDIA Titan V GPU.

Additional Results on H3.6M

In Table 1 and 2, we report the Human3.6M results in angle representation for short-term and long-term prediction, respectively. Here, we average the error over 256 random sub-sequences per action, which was proven in to be more stable than averaging over 8 random sub-sequences per action as is commonly done. Our conclusions remain unchanged: our approach achieves the state-of-the-art performance for both short-term and long-term prediction on average.

2 Generating Long Future for Periodical Motions

For periodical motions, such as “Walking”, our approach can generate very long futures (up to 16 seconds). As shown in the supplementary video, such future predictions are hard to distinguish from the ground truth even for humans.

Additional Results on AMASS

In Fig. 1, we compare the results of LTD and of our approach on the BMLrub dataset. Our results better match the ground truth.

Motion Attention vs. Frame-wise Attention

To further investigate the influence of motion attention, where the attention on the history sub-sequences $\{\textbf{X}_{i:i+M+T-1}\}_{i=1}^{N-M-T+1}$ is a function of the first $M$ poses of every sub-sequence $\{\textbf{X}_{i:i+M-1}\}_{i=1}^{N-M-T+1}$ (keys) and the last observed $M$ poses $\textbf{X}_{N-M+1:N}$ (query), we replace the keys and query with the last frame of each sub-sequence. That is, we use $\{\textbf{X}_{i+M-1}\}_{i=1}^{N-M-T+1}$ as keys and $\textbf{X}_{N}$ as query. We refer to the resulting method as Frame-wise Attention. As shown in Table 3, motion attention outperforms frame-wise attention by a large margin. As discussed in the main paper, this is due to frame-wise attention not considering the direction of the motion, leading to ambiguities.