Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning

Xinshun Wang, Zhongbin Fang, Xia Li, Xiangtai Li, Mengyuan Liu

Introduction

The idea of multi-task models has attracted much research interest over the years in a wide range of areas , such as computer vision and natural language processing (NLP). To build such models, a prominent line of works focus on deploying task-specific heads on a shared backbone and learning them through pretraining and finetuning. Models with such task-specific architectures are not learned in an end-to-end manner and are hard to generalize to new, unseen tasks and data. On skeleton-based human-centric tasks, most existing methods rely on task-specific designs and/or non-end-to-end frameworks to accomplish multiple tasks. Several works learn unified human motion representations. MotionBERT adopts two-stage training, including pre-training on the 2D-3D lifting task and fine-tuning on each task. UniHCP connects a well-designed shared backbone network to specific task heads according to different tasks. UPS unifies action labels and motion sequences into text-based language sequences and fine-tunes them on each task for better performance. These models can have a limited scope of tasks they are capable of handling and lack of flexibility, which limits their capability to generalize to unseen tasks and data. To overcome these challenges, in-context learning represents a new trend to build generalist models, which is originally proposed in NLP and then introduced into other areas such as images and point clouds. Some generalist models utilize the Masked Image Modeling (MIM) framework to exhibit multi-task capability via in-context learning in image processing, as shown in Fig. 1(a). Similarly, in Fig. 1(b), Point-In-Context proposes a key joint sampling module and introduces in-context learning into point cloud processing. Specifically, in-context models learn patterns of provided input-output pairs, called in-context examples or prompts, and then perform the desired task according to the prompt. An in-context model can accomplish diverse tasks end-to-end without any task-specific heads or fine-tuning. It also has excellent generalization capability. Directly adapting existing in-context models from other areas onto skeleton sequences fails because they largely focus on static objects (images or point clouds) and thus cannot capture spatial-temporal dependencies in skeleton sequences. To the best of our knowledge, in-context learning remains unexplored in 3D skeleton sequence modeling.

In this paper, we propose Skeleton-in-Context (SiC), a novel framework for skeleton sequence modeling, which is able to process multiple tasks simultaneously after a single training process without any task-specific designs, as Fig. 1(c) shows. Specifically, SiC perceives tasks from context, i.e., the prompt from a desired task provided as additional input, and then accomplishes the task for the query. SiC employs two types of prompts to facilitate context perception: the task-guided prompt (TGP) and the task-unified prompt (TUP). TGP provides a task template for the model, giving the model the ability to make analogies. This is the key to the model’s capability to generalize to new tasks, as we can guide the model to do unseen tasks by customizing specific prompts. TUP can be implemented in two ways: static and dynamic. The static TUP leverages prior knowledge shared among tasks, while the dynamic TUP adaptively learns to perceive task context. Skeleton-in-Context provides a new perspective for training a skeleton-based multi-task in-context model. Since there is no existing benchmark for measuring such a model, we establish a new in-context benchmark based on four tasks, including three skeleton-based tasks: motion prediction, pose estimation, and joint completion, and a new task: future pose estimation.

Our main contributions are as follows: 1) We introduce a novel framework, Skeleton-in-Context (SiC), designed to handle multiple skeleton-based tasks simultaneously. Our approach learns the potential pattern of the given prompt and performs it on the query sample. 2) We establish a new 3D human in-context learning benchmark comprised of multiple skeleton-based tasks, such as motion prediction, pose estimation, joint completion, and future motion estimation. We further benchmark several representative in-context learning baselines for reference. 3) We conduct extensive experiments to evaluate our SiC. SiC achieves state-of-the-art results in multi-tasking and even outperforms task-specific models on certain tasks. Moreover, SiC also generalizes well to unseen tasks such as motion in-between.

Related Work

Skeleton-Based Motion Analysis. Many of the skeleton-based human-centric tasks have shared similar development patterns in terms of the methods used to address them. On human motion prediction, early successes are achieved with RNNs and CNNs . Nowadays, the task is dominated by Graph Convolutional Networks (GCNs). LTD considers each joint coordinate as a node in a graph on which to employ learnable spatial graph convolutions. Follow-up works focus on how to assist graph convolutions with additional human structural information or patterns. Recent works combine spatial and temporal graph convolutions to learn spatial-temporal dependencies. Different from motion prediction, the goal of motion synthesis is to complete the human motion sequence. To complete the missing parts of human motion, many works adopt convolutional models or generative adversarial networks . one category of 3D pose estimation is to lift the estimated 2D pose to 3D, which often employs spatial-temporal convolution or transformer-based methods. However, these approaches focus on designing customized architectures specific to a task, which are adaptive to a limited range of applications. To this end, we propose to unify the input and output formats of multiple tasks and utilize a single model to process multiple tasks simultaneously via in-context learning.

In-Context Learning. The emergence of in-context learning provides a new paradigm for achieving multi-tasking, which originated from Natural Language Processing (NLP) . It endows models with the capability to handle multiple tasks and even unseen tasks by incorporating domain-specific input-target pairs, known as prompts, into the query sample . Following NLP, in-context learning is extended to computer vision in some work, such as 2D images and 3D point clouds . Specifically, Visual Prompt , Painter , and Point-In-Context are pioneers of in-context methods in computer vision. They verify the effectiveness and flexibility of in-context learning in vision-based tasks, and most of them adopt pure-transformer-based approaches . Meanwhile, several approaches deeply explore Prompt Engineering and awe-inspiring generalization in in-context learning. Recently, LVM unifies most vision tasks and data via joint co-training on one visual in-context model. Compared to the works focused on static objects (images and point clouds), our work mainly delves into modeling in-context dynamic skeleton sequences.

Unified Skeleton-Centric Models. A generalist model that can excel in handling multiple tasks is a profound step toward general artificial intelligence . Previous works give the effort to explore co-training several human-centric tasks. Meanwhile, some works show great progress in simultaneously predicting parsing and pose. Recent research demonstrates the effectiveness of modeling unified human motion sequences. In particular, UniHCP designs specific task heads according to different 2D human-centric tasks. Motion-BERT is pre-trained in the 2D-3D lifting task and fine-tuned on each task. UPS unifies heterogeneous human behavior understanding tasks via language sequences and cycle training for each task in order. Nevertheless, these works are limited by the dilemma of complex training stages or task heads, which introduce extra costs. We integrate the output format of different tasks in skeleton sequence format and train the model for just one time.

In-Context Skeleton Sequence Modeling

where $k$ represents the task index while $i$ and $j$ indicate different example indexes. Thus, we can determine what task will be performed on the query sample by selecting the task-specific prompt from the corresponding training set.

2 Task Definitions

To measure multitasking capability, we establish a new in-context benchmark from common skeleton-based datasets including H3.6M , AMASS , and 3DPW . In this benchmark, we select three common tasks: motion prediction, pose estimation, and joint completion, and a new task: future pose estimation. Consequently, we conduct extensive experiments on a large and diverse dataset collection, which consists of 176K skeleton sequence samples.

Motion Prediction. The goal of motion prediction is to predict future motion sequences based on historical motion sequences. There is continuity between the input and output sequences, which constitute a complete motion. Thus, the input $I$ and target $T$ are complete skeleton sequences with shapes of $(F,J,3)$ .

Joint Completion. As a new task, the input of joint completion is a skeleton sequence with some joints missing, which is randomly discarded. It is general that several joints may be missing or displaced when collecting with Lidar cameras . To mitigate the gap with the skeleton sequence collected in reality, we imitate the above situation, take the skeleton sequence of missing joints as input, and complete the missing joints. For the displacement problem, we can mask the incorrect joints and complete them. We set two levels to verify the ability to complete joints, which are 40 $\%$ and 60 $\%$ of the joints missing in each frame. Also, the coordinates of missing joints are replaced with 0, with the goal of unifying the input skeleton sequence format with $(F,J,3)$ .

Skeleton-in-Context

In-context learning on other data modalities such as images and point clouds largely employ Mask Image Modeling (MIM) to reconstruct the randomly masked elements. During inference, only the target elements are masked. On skeleton sequences, MIM fails due to spatial-temporal complexity and inter-pose similarity, which makes it outstandingly hard to perceive the task correctly from a subtle context. In our Skeleton-in Context, we employ a new framework to address this challenge.

Task-Guided Prompt. In order to endow the model with the ability to handle multiple tasks simultaneously, a prompt is needed to instruct the model on what task should be conducted on the Query sample. This is exactly the meaning of in-context learning: learning the potential pattern hidden in the Prompt from the contextual input. Here, the potential pattern contained between the input and output of the Prompt is the task pattern. As mentioned in Sec. 3.1, we propose the Task-Guided Prompt (TGP) $G^{k}_{i}=\left\{I^{k}_{i},T^{k}_{i}\right\}$ , which comprises of an example skeleton pair performing a specific task $k$ . In this way, our proposed model is able to perform a variety of tasks according to the provided prompts that could be seen or even unseen in the training set. It is worth that our model can perform unseen tasks and transfer to datasets that are different styles from the training dataset. It will be elaborated in Sec. 5.2.

Avoiding Over-Fitting. Previous works have great progress in unifying different tasks via the MIM-style framework . Compared with static objects in 2D and 3D, such as images and point clouds, the obstacle to in-context learning of skeleton sequences lies in the wondrous similarity of each frame, so directly using the random masking strategy will cause the model to fall into over-fitting, leading it to only learn simple interpolation frames rather than capturing meaningful motion sequence representations during the training stage. As shown in Fig. 2, when we directly utilize the MPM-style framework to train our model, although it can perfectly reconstruct the masked frame during training, the experimental results reveal that it outputs a devastating result during inference. It is obvious that the output skeleton sequence gradually shrinks as the length increases because the next frame of the current frame is 0, and interpolating with 0 will lead to shrinking. Such a phenomenon underscores the challenges of learning contextual information about dynamic skeleton sequences.

Task-Unified Prompt. With the Task-Guided Prompt that exhibits unique capabilities specific to in-domain or out-of-domain tasks, mitigating the gap between different tasks brings a new challenge. Meanwhile, as discussed above, a straight extension of previous works will lead to a simple model that can only perform the interpolation task based on the given skeleton sequence. To tackle these challenges, we introduce the Task-Unified Prompt (TUP), a well-designed and task-agnostic module, which can adaptively learn to incorporate multiple tasks into the framework of in-context learning, i.e., a one-off end-to-end framework for all tasks without having to deploy task-specific heads or fine-tuning. We introduce two versions of TUP:

$Version~{}1:$ Static-TUP with static pseudo pose:

where $\mathcal{E}\left(\cdot\right)$ is a embedding layer. $M$ and $K$ indicate the total number of training samples and tasks, respectively.

$Version~{}2:$ Dynamic-TUP with dynamic pseudo pose:

The SiC based on Static-/Dynamic-TUP is referred to as Staitc-/Dynamic-SiC respectively. The proposed TUP facilitates the modeling of distribution differences among tasks and datasets, so the model focuses more on learning unified representations shared across tasks.

2 Model Architecture

As shown in Fig. 3, given a prompt pair $G^{k}_{i}=\left\{I^{k}_{i},T^{k}_{i}\right\}$ and a query sample $Q^{k}_{j}$ , we first map them to the C-dimensional feature space, and then splice $I$ and $T$ , $Q$ and $U$ respectively. The former represents our proposed Task-Guided Prompt, and the latter is the objective sequence of our focused analysis. The process can be formulated as:

where $\mathcal{E}\left(\cdot\right)$ is a linear transformation and $\parallel$ is concatenating operator. In essence, $U$ represents the prior knowledge of the target skeleton sequence and is also an adapter that adapts to different tasks and different distributions of data for learning.

Drawing inspirations from , we connect spatial attention blocks and temporal attention blocks together in an alternating combination manner to form a two-stream transformer $\mathcal{M}\left(\cdot\right)$ dedicated to capturing the dynamic characteristics of motion, which is implemented as:

where $\alpha$ and $\beta$ are balanced parameters determined adaptively. $\mathcal{S}_{1}^{i}$ and $\mathcal{T}_{1}^{i}$ are implemented using spatial self-attention mechanism and temporal self-attention mechanism, respectively. Additionally, $i=1,...,N$ .

Then, $\mathcal{M}\left(\cdot\right)$ takes $\widetilde{G}$ and $\widetilde{Q}$ as input and parallel iterates over them $n_{1}$ times respectively. Then, $\widetilde{G}^{n1}$ and $\widetilde{Q}^{n1}$ are integrated via aggregation function $\mathcal{A}\left(\cdot\right)$ :

we consider $\mathcal{A}\left(\cdot\right)$ to be a simple weighted summation function. The final output is obtained after iterating $W$ $n_{2}$ times in $\mathcal{M}\left(\cdot\right)$ , where $n_{1}+n_{2}=N$ . Note that we take the second half of the output as the prediction.

3 Training and Inference

Training. We train our model in our proposed skeleton-centric in-context dataset mentioned in Sec. 3. In this training way, our Skeleton-in-Context can perform multiple tasks after one training. Since our output format is skeleton-like, following previous work , we adopt the 3D joint-level reconstruction loss:

Inference. After a training phase involving all tasks, we get an in-context model that focuses on dynamic skeleton sequences. It can perform multiple tasks at the same time according to the provided task-dependent prompt without any further parameter updates.

Experiments

Implementation Details. We implement the proposed Skeleton-in-Context model with the number of layers $N$ = 5, number of attention heads $H$ = 8, and hidden feature dimension $C$ = 256. For each prompt input/target and query input/target, the sequence length is $F$ = 16. All tasks are unified into a one-off, end-to-end training process, without any task-specific designs. The training takes about 6 hours for 120 epochs on 4 NVIDIA GeForce RTX 4090 GPUs.

Datasets and Metrics. H3.6M (Human3.6M) is used for the pose estimation and future pose estimation tasks. AMASS is used for the motion prediction task. 3DPW (3D Pose in the Wild) is used for the joint completion task. All of the tasks can be evaluated using Mean Per Joint Position Error.

In order to make a qualitative comparison with other methods in a relatively fair situation, we reimplement the task-specific SoTA methods in our proposed benchmark. Additionally, we transform recent methods into multi-task models and compare them with our model.

Copy Method. Following , we copy the prompt target $T_{i}^{k}$ as prediction results for comparison.

Multi-Task-Capable Models. For task-specific models and multi-stage models , they are not able to handle multiple tasks at the same time. Therefore, we re-model them into end-to-end multi-task-capable models to achieve fair comparison.

In-Context Models. Since there are no in-context models excelling in processing multiple human perception tasks, we select the most relative in-context models with our benchmark, Point-In-Context and PointMAE , which is proposed in 3D point cloud understanding. Specifically, we take each frame as a token in their framework. For PointMAE, we use the transformed version in .

2 Compared with SoTA Methods

Quantitative Results. We report experimental results on four tasks in Tab. 1. We term Skeleton-in-Context implemented with static/dynamic-TUP as Static/Dynamic-SiC, respectively. For other models, we reproduce them and report the results. As can be seen from Tab. 1, our proposed model outperforms all multi-task models, achieving state-of-the-art results on the in-context multi-task benchmark. Notably, the results of task-specific models on a single task reflect the difficulty of each task. However, we also exhibit impressive performance in comparison with task-specific models, which demonstrates our proposed model is an expert in handling skeleton sequence-based multi-tasking.

Qualitative Results. We visualize and compare the results of our SiC and the most recent SoTA model, MotionBERT , which is re-structured as an end-to-end multi-task model for fair comparison. As highlighted in Fig. 4, our SiC can generate more accurate poses than MotionBERT according to the provided task-guided prompt.

3 Generalization Capability.

Generalize to New Datasets. To verify the generalization ability of our model, we conduct cross-dataset experiments on a specific task. For instance, our proposed model learns motion prediction on the AMASS dataset under the default benchmark, then it is used to predict future motion sequences on 3DPW during testing. Similarly, we set up four experiments: motion prediction (AMASS $\rightarrow$ 3DPW), pose estimation (H3.6M $\rightarrow$ AMASS, 3DPW), and joint completion (3DPW $\rightarrow$ H3.6M). As shown in Tab. 2, our model is able to generalize the learned task well to other datasets, while other task-specific models cannot accommodate the large bridge introduced across datasets even if they are specifically trained on this single task. Additionally, benefiting from the dynamic pseudo pose, the Dynamic-SiC exhibits more excellent generalization ability than the Static-SiC.

Generalize to New Task. Furthermore, We construct a prompt that has not appeared in the training set, which performs motion in-between. As Fig. 5 shows, our trained model can complete the missing skeletons according to the customized task-guided prompt. Our SiC generalizes well to a new task, which is unexplored in other works.

4 Ablation Study

Effectiveness of the Skeleton Prompt. We conduct ablation experiments on our proposed skeleton prompts. As shown in Tab. 3, the skeleton prompts solve the problem of directly extending previous work to skeleton sequences, which is falling into over-fitting. Meanwhile, applying both TGP and TUP to the model can achieve the best results. Our proposed skeleton prompts replace the random masking training strategy used in previous work. In particular, TUP creates a training environment that does not require the query target, which ensures that the model will not see the ground truth of the query input in advance.

Prompt Engineering. Since previous work has proven that different prompts in context learning affect the performance of the model, we conduct ablation experiments on the choice of prompts. As Tab. 4 shows, we select the prompt by the feature similarity and MPJPE minimum between the input skeleton sequence and the prompt respectively. Experimental results show that the random selection method achieves the best results. Unlike 2D images and 3D point clouds, skeleton-based task data is highly similar, resulting in different prompts playing similar roles in the same task. While the random selection method expands the model’s receptive field to the data and achieves better results.

Architecture. We conduct ablation experiments on the architecture of the model. As shown in Tab. 5, after exploring different hidden feature dimensions and transformer depths, we find that the model performs best when the feature dimension is $256$ and the depth is $5$ .

5 In-Context Prior Pose

We conduct an in-depth exploration of Dyanamic-TUP. We extract the trained Dyanamic-TUP weight parameters, then randomly initialize a $(F,J,3)$ tensor and use the trained encoder to increase its feature dimension to C-dimension. Finally, an optimizer is used to optimize the tensor and minimize its feature similarity with the trained Dyanamic-TUP. The top of Fig. 6 represents the visualization of Static-TUP, while the bottom is the Dyanamic-TUP. The concrete skeleton of Dyanamic-TUP is similar to the mean pose Static-TUP of the training set, which demonstrates that the Dyanamic-TUP learned by our model is actually the prior knowledge whose receptive field is all ground truth in the training set. This prior knowledge helps our model learn more smoothly on multiple tasks, and adding learnable modules helps our model generalize to other datasets or new tasks (Sec. 5.3).

Conclusion

We propose the Skeleton-in-Context, designed to process multiple skeleton-base tasks simultaneously after just one time of training. Specifically, We build a skeleton-based in-context benchmark covering a wide range of tasks. Moreover, We propose skeleton prompts composed of TGP and TUP, which solve the overfitting problem of skeleton sequence data trained under the training framework commonly applied in previous 2D and 3D in-context models. Besides, we demonstrate that our model can generalize to different datasets and new tasks, such as motion completion. We hope it paves the way for further exploration of in-context learning in the skeleton-based sequences.

References

Appendix A Experimental Details

We implement the proposed Skeleton-in-Context model with the number of layers $N$ = 5, number of attention heads $H$ = 8, and hidden feature dimension $C$ = 256. For each prompt input/target and query input/target, the sequence length is $F$ = 16. We implement Skeleton-in-Context with PyTorch. In the default setting, during both training and evaluation, for each query pair, we randomly select a prompt pair from the training set of the same task as the query pair. We use an AdamW optimizer with a linearly decaying learning rate, starting at 0.0002 and decreasing by 1% after every epoch. All tasks are unified into a one-off, end-to-end training process, without any task-specific designs. The training takes about 6 hours for 120 epochs on 4 NVIDIA GeForce RTX 4090 GPUs.

A.2 Datasets and Metrics

H3.6M (Human3.6M) is used for the pose estimation and future pose estimation tasks. It is a large-scale dataset, which contains 3.6 million video frames of actions involving 15 types and 7 actors. Following previous works , we use subjects 1, 5, 6, 7, and 8 for training, and subjects 9, 11 for testing, and preprocessing the poses. After preprocessing, each pose has 17 joints. We use the Stacked Hourglass (SH) networks to extract the 2D skeletons from videos as input. For pose estimation, we expect the model to estimate the corresponding 3D skeletons. For future pose estimation, we expect the model to predict and estimate at the same time the future 3D skeletons in a time range of 300ms, given the history 300ms of 2D skeletons. As we use the same dataset on two tasks, it makes multi-tasking harder as the model needs further context to correctly perceive and then accomplish the task. In Skeleton-in-Context we use prompt to guide the model to learn in context.

AMASS is used for the motion prediction task. It integrates most of the existing marker-based Mocap datasets, which are parameterized with a unified representation. We follow common practices in human motion prediction to use AMASS-BMLrub for testing and the rest of the datasets for training. For motion prediction, we expect the model to predict the future 3D motion sequence given the history 3D motion sequence. The time ranges of the future and history motion sequences are both 400ms, which is 10 frames under the frame rate of 25 fps. As the model requires the sequence to be of 16 frames, we pad the last poses 6 times.

3DPW (3D Pose in the Wild) is used for the joint completion task. It has more than 51k frames with 3D annotations for challenging indoor and outdoor activities. For joint completion, we construct 2 settings, where we respectively randomly mask 40% and 60% of all the joints. We expect the model to reconstruct the missing joints.

Metrics. For Motion Prediction (MP.), Joint Completion (JC.), and Future Pose Estimation (FPE.), we report Mean Per Joint Position Error (mm) . For Pose Estimation (PE.), we additionally report another indicator, Normalized Mean Per Joint Position Error (N-MPJPE) .

Appendix B Multi-Task Synergistic Training

In this section, we address the challenge brought by multi-task training and demonstrate the effectiveness of our model under multi-task learning of human motion representations. First, we analyze the effect of negative transfer and how it can limit multi-task training. Next, we evaluate the effectiveness of our proposed task-guided and task-unified prompts (TGPs and TUPs) in addressing this challenge and facilitating collaborative training between multiple tasks.

A straight way to train a generalist multi-task model is to collect the data from multiple tasks, format them into the same shape, and directly train the model on them. However, as explained in , a phenomenon named negative transfer in multi-dataset training will occur and lead to poor results in the above training method. As different tasks have their own unique goals, they may confuse the model during training without guidance from context, leading to performance drop in testing. To demonstrate the effect of negative transfer, as shown in Tab. 6, we train and test the backbone of our SiC separately on four tasks, whose results are in Tab. 6 with gray background. Note that the backbone here does not include TUP and TGP. When we train the four tasks together, the performance of the model decreases due to the data variability between the individual tasks, which can be attributed to the negative transfer phenomenon . The experiments show that, without context guidance from prompts (TUP and TGP), the multi-task training is limited by negative transfer and not able to achieve satisfactory performance.

B.2 Multi-Task Synergistic Training

Our Skeleton-in-Context enables multiple tasks to be trained in a unified way, we term it Multi-Task Synergistic Training. With our proposed Task-Guided Prompt (TGP), SiC can perceive context (the corresponding task-specific information) from TGP, and accomplish the task accurately. Meanwhile, in order to deal with data from different datasets and tasks, which have different patterns and may affect the performance if not processed properly, we introduce the Task-Unified Prompt (TUP) as prior knowledge of the query target. As a task-agnostic module, the TGP is able to encode the unified human motion representations of various tasks. As Tab. 6 shows, with the proposed TGP and TUP, our SiC is able to achieve impressive results on four tasks simultaneously.

Appendix C Detailed Results and Visualization

We report the action-wise results in pose estimation and future pose estimation on H3.6M . As shown in Tab. 7 and Tab. 8, our model outperforms other multi-task models in each action on H3.6M, which verifies our model’s ability to handle multi-tasks. Additionally, Dynamic-SiC performs better than Static-SiC in most actions.

C.2 More Visualization

We provide more visualization results on four tasks in comparison with all the multi-task baselines that we mention in the main text, as shown in Fig. 7, 8, 9, 10. Other models are obviously unable to balance the resources of each task during training, which leads to unsatisfactory results.