Tracking Objects as Pixel-wise Distributions

Zelin Zhao, Ze Wu, Yueqing Zhuang, Boxun Li, Jiaya Jia

Introduction

Multi-Object Tracking (MOT) is a long-standing challenging problem in computer vision, which aims to predict the trajectories of objects in a video. Prior work investigates the tracking paradigms , optimizes the association procedures and learns the motion models . Recently, with the powerful transformers deployed in image classification and object detection , concurrent work applies transformers to multi-object tracking . Albeit the promising results, we note that the power of the transformer still has much room to explore.

As shown in Figure 1, current transformer-based MOT architectures MOTR and TransCenter still face challenges in detecting small objects and handling occlusions. Representing objects via pixel-wise distributions and conducting pixel-wise associations might alleviate these issues for the following reasons. First, pixel-wise information may help overcome occlusion based on low-level clues . Moreover, recent transformer architecture demonstrates strong performance in pixel-wise prediction . From another perspective, pixel-wise prediction preserves more low-confident details, which can improve tracking robustness .

We propose a transformer approach, called P3AFormer, to conduct pixel-wise propagation, prediction, and association. P3AFormer propagates information between frames via the dense feature propagation technique , exploiting context information to resist occlusion. To produce robust pixel-level distribution for each object, P3AFormer adopts the meta-architecture , which generates object proposals and object-centric heatmaps. Masked attention is adopted to pursue localized tracking results and avoid background noise. Further, P3AFormer employs a multi-scale pixel-wise association procedure to recover object IDs robustly. Ablation studies demonstrate the effectiveness of these lines of improvement.

Besides these pixel-wise techniques, we consider a few whistles and bells during the training of P3AFormer. First, we use Hungarian matching (different from direct mapping) to enhance detection accuracy. Second, inspired by the empirical findings of YOLOX , we use strong data augmentation, namely mosaic augmentation, in the training procedure of P3AFormer. Further, P3AFormer preserves all low-confident predictions to ensure strong association.

We submit our results to the MOT17 test server and obtain 81.2% MOTA and 78.1% IDF1, outperforming all previous work. On the MOT20 benchmark, we report test accuracy of 78.1% MOTA and 76.4% IDF1, surpassing existing transformer-based work by a large margin. We further validate our approach on the KITTI dataset. It outperforms state-of-the-art methods. Besides, we validate the generalization of the pixel-wise techniques on other MOT frameworks and find that these pixel-wise techniques generalize well to other paradigms.

Related Work

We first discuss concurrent transformer-based MOT approaches. TrackFormer applies the vanilla transformer to the MOT domain and progressively handles newly appeared tracks. TransTrack and MOTR take similar approaches to update the track queries from frames. They explore a tracking-by-attention scheme, while we propose to track objects as pixel-wise distributions. TransCenter conducts association via center offset prediction. It emphasizes a similar concept of dense query representation. However, the model design and tracking schemes are different from ours. TransMOT is a recent architecture to augment the transformer with spatial-temporal graphs. It is noted that our P3AFormer does not use graph attention. We validate different paradigms and model components in experiments to support the design choices of our method.

2 Conventional multi-object tracking

The widely used MOT framework is tracking-by-detection . DeepSORT leverages the bounding box overlap and appearance features from the neural network to associate bounding boxes predicted by an off-the-shelf detector . Yang et al. propose a graph-based formulation to link detected objects. Other work tries different formulations. For example, Yu et al. formulate MOT into a decision-making problem in Markov Decision Processes (MDP). CenterTrack tracks objects through predicted center offsets. Besides, it is also investigated to reduce post-processing overhead by joint detection and tracking . Another line of work leverages graph neural networks to learn the temporal object relations.

3 Transformer revolution

Transformer architectures achieved great success in natural language processing (NLP) . Recently, transformer demonstrated strong performance in various vision tasks, such as image classification , object detection , segmentation , 3d recognition and pose estimation . The seminal work proposes a simple framework DETR for end-to-end object detection. MaskFormer utilizes a meta-architecture to generate pixel embeddings and object proposals via transformers jointly. Previous transformers use masks in attention to restrict attention region or force the computation to be local .

4 Video object detection

Tracking by detection paradigm requires accurate object detection and robust feature learning from videos . Zhu et al. propose dense feature propagation to aggregate features from nearby frames. The follow-up work improves aggregation and keyframe scheduling. The MEGA model combines messages from different frames on local and global scales. These methods do not consider video object detection with transformers. TransVOD proposes aggregating the transformer output queries from different frames via a temporal query encoder. TransVOD cannot be directly applied to our setting because it is not an online algorithm and does not make pixel-wise predictions.

5 Pixel-wise techniques

Pixel-wise techniques have been proven effective in various applications in computer vision. Dense fusion and pixel-wise voting network are proposed to overcome occlusions in the object pose estimation . DPT uses a dense prediction transformer for monocular depth estimation and semantic segmentation. Pyramid vision transformer replaces convolutional neural networks by attention to dense prediction tasks. Yuan et al. presents a high-resolution transformer for human pose estimation. Our P3AFormer, instead, explores the power of pixel-wise techniques in the MOT domain.

Pixel-wise Propagation, Prediction and Association

Different from tracking objects via bounding boxes or as points , we propose to track objects as pixel-wise distributions. Specifically, P3AFormer first extracts features from each single frame (Sec. 3.1), summarizes features from different frames via pixel-wise feature propagation (Sec. 3.2) and predicts pixel-wise object distributions via an object decoder (Sec. 3.3). The training targets are listed in Sec. 3.4. During inference, P3AFormer conducts pixel-wise association (Sec. 3.5) to build tracks from object distributions.

As shown on the top-left of Figure 2, P3AFormer uses a backbone to generate latent features and a pixel decoder to produce pixel-wise heatmaps. The details are as follows.

1.2 Pixel decoder.

P3AFormer uses the pixel decoder , which is a transformer decoder, to up-sample the features $\mathbf{F}^{(t)}$ and generate per-pixel representation $\mathbf{P}^{(t)}_{l}$ where $l$ is the current decoding level. The pixel encoder is also applied to the previous-frame feature $\mathbf{F}^{(t-1)}$ to get the pixel-wise feature $\mathbf{P}^{(t-1)}_{l}$ . In our work, we use a multi-scale deformable attention transformer as the pixel decoder.

2 Pixel-wise feature propagation

Extracting temporal context from nearby frames is very important in MOT . We use the pixel-wise flow-guided feature propagation to summarize features between frames (shown in the middle-left of Figure 2). Formally, given a flow network $\Phi$ , the flow guidance can be represented as $\Phi(\mathcal{I}^{(t-1)},\mathcal{I}^{(t)})$ . After that, a bilinear warping function $\mathcal{W}$ transforms the previous-frame feature to align with the current feature as

where the weight $w^{(t-1)->(t)}$ is the pixel-wise cosine similarity between the warped feature $\mathbf{P}_{l}^{(t-1)->(t)}$ and the reference feature $\mathbf{P}_{l}^{(t)}$ . The pixel-wise cosine similarity function is provided in the supplementary file for reference. The shape of $\mathbf{P}_{l}$ is denoted as $H_{P_{l}}\times W_{P_{l}}\times d$ .

3 Pixel-wise predictions

Since an image from an MOT dataset often involves a large number of small objects, local features around the objects are often more important than features at long range . Inspired by the recent discovery that masked attention can promote localized feature learning, P3AFormer uses masked attention in each layer of the object decoder. The mask matrix $\mathbf{M}_{l}$ is initialized as an all-zero matrix, and it is determined by the center heatmaps at the previous level (presented in Eq. (5)). The standard masked attention can be denoted as

After getting the center heatmaps, the attention mask corresponding to $i$ -th object and position $(x,y)$ is updated as

Such a mask restricts the attention to the local region around the object center, much benefiting the tracking performance (as shown in Sec. 4.4).

4 Training targets

P3AFormer leverages the bipartite Hungarian matching to match $N$ predicted objects to $K$ ground-true objects. During classification, the unmatched object classes are set to an additional category called “no object” ( $\varnothing$ ). Following MaskFormer , we adopt a pixel-wise metric (instead of bounding boxes) in the Hungarian matching.

First, we construct the ground-true heatmap $h_{l}^{i}$ for an object via a Gaussian function where the Gaussian center is at the object center, and the Gaussian radius is proportional to the object size . Given the predicted center heatmaps $\hat{h_{l}^{i}}$ , class distribution $\hat{p_{l}^{i}}$ of the $i$ th object, and the corresponding ground true center heatmaps $h_{l}^{i}$ and object class $c_{l}^{i}$ , we compute the association cost between the prediction and the ground truth via the pixel-wise cost function of

P3AFormer further computes three losses given the matching between predictions and ground-true objects: (1) cross-entropy loss between the predicted and ground-true classes; (2) focal loss between the predicted center heatmaps and ground-true center heatmaps; (3) size loss computed by the $L1$ loss between predicted and ground true size. The final loss is a weighted form of these three losses summarized for all levels.

5 Pixel-wise association

After representing objects as a pixel-wise distribution, P3AFormer adopts a pixel association procedure to recover object tracks across frames. This procedure is conducted from the first frame ( $t=0$ ) to the current frame ( $t=T$ ) in a frame-by-frame manner, which means P3AFormer is an online tracker.

We sketch the association procedure at the timestep $t=T$ in Figure 3. A track $\tau_{k}$ corresponds to an object with a unique ID $k$ . We store into $\tau_{k}$ the bounding boxes $\tau_{k}$ .bbox (recovered by the predicted center and size), the score $\tau_{k}$ .score (the peak value in the center heatmap), predicted class $\tau_{k}$ .class and the center heatmap $\tau_{k}$ .heatmap.

In step A, We use the Kalman Filter to predict the location of objects in the current frame ( $t=T$ ) based on previous center locations ( $t<T$ ). The heatmaps are translated along with the forecast movement of the object center via bilinear warping. Step B uses the Hungarian algorithm to match the pixel-wise prediction with the track forecast. P3AFormer only matches objects under the same category and omits the “no-object” category. The association cost for Hungarian matching is the L1 distance between a track’s forecast heatmap and an object’s predicted heatmap. We accept a matching if the association cost between the matched track and prediction is smaller than a threshold $\eta_{m}$ . A new track $\tau_{k^{\prime}}$ would be initialized for an untracked object $k^{\prime}$ if its confidence $\tau_{k^{\prime}}$ .score is larger than $\eta_{s}$ . In step C, the dead tracks that are not matched with any prediction for $n_{k}$ frames are killed. All the above thresholds $\eta_{m}$ , $\eta_{s}$ , and $\eta_{k}$ are hyper-parameters detailed in Sec. 4.2 and we provide more algorithm details in the supplementary file.

Experiments

P3AFormer accomplishes superior results to previous MOT approaches on three public benchmarks. We then ablate each component of the P3AFormer to demonstrate the effectiveness of the pixel-wise techniques. After that, we generalize the proposed pixel-wise techniques to other MOT frameworks.

The MOT17 dataset is focused on multiple persons tracking in crowded scenes. It has 14 video sequences in total and seven sequences for testing. The MOT17 dataset is the most popular dataset to benchmark MOT approaches . Following previous work during validation, we split the MOT17 datasets into two sets. We use the first split for training and the second for validation. We denote this validation dataset as MOT17-val for convenience. The best model selected during validation is trained on the full MOT17 dataset and is submitted to the test server under the “private detection” setting. The main metrics are MOTA, IDF1, MT, ML, FP, FN, and IDSW, and we refer the readers to for details of these metrics. For MOT17, we add CrowdHuman , Cityperson , and ETHZ into the training sets following . When training on an image instead of a video with no neighboring frames, the P3AFormer model replicates it and takes two identical images as input.

1.2 MOT20 [13].

The MOT20 dataset consists of eight new sequences in crowded scenes. We train on the MOT20 training split with the same hyper-parameters as the MOT17 dataset. We submit our inferred tracks to the public server of MOT20 under the “private detection” protocol. The evaluation metrics are the same as MOT17.

1.3 KITTI [23].

The KITTI tracking benchmark contains annotations for eight different classes while only two classes “car” and “pedestrian” are evaluated . Twenty-one training sequences and 29 test sequences are presented in the KITTI benchmark. We split the training sequences into halves for training and validation following . Besides the common metrics, KITTI uses an additional metric of HOTA to balance the effect of detection and association.

2 Implementation details

We mainly use ResNet and Swin-Transformer as the backbone in P3AFormer. For ResNet, we use the ResNet-50 configuration. For Swin-Transformer, we use the Swin-B backbone . We use Swin-B in all final models submitted to the leaderboards and ResNet-50 for validation experiments. The hidden feature dimension is $d=128$ .

We adopt the deformable DETR decoder as the multi-scale pixel-wise decoder. Specifically, we use six deformable attention layers to generate feature maps, and the resolutions are the same as Mask2Former . We have in total $L=3$ layers of feature maps. We add sinusoidal positional and learnable scale-level embedding to the feature maps following .

During feature propagation, we use the simple version of FlowNet pre-trained on the Flying Chairs dataset. The generated flow field is scaled to match the resolution of feature maps with bilinear interpolation.

The object decoder also has $L=3$ layers. We adopt $N=100$ queries, which are initialized as all-zeros and are learnable embeddings during training. No dropout is adopted since it would deteriorate the performance of the meta architecture .

The thresholds in the pixel-wise association are $\eta_{m}=0.65$ and $\eta_{s}=0.80$ on all benchmarks. We found that the P3AFormer model is robust under a wide range of thresholds (see supplementary). The lost tracks are deleted if they do not appear after $n_{k}=30$ frames.

The input image is of shape $1440\times 800$ for MOT17/MOT20 and $1280\times 384$ for KITTI. Following , we use data augmentation, such as Mosaic and Mixup . We use AdamW with an initial learning rate of $6\times 10^{-5}$ . We adopt the poly learning rate schedule with weight decay $1\times 10^{-4}$ . The full training procedure lasts for 200 epochs. The P3AFormer models are all trained with eight Tesla V100 GPUs. The specific configurations of the losses are provided in the supplementary. The run-time analysis of different models is provided in the supplementary.

3 Comparisons on public benchmarks

We first compare the P3AFormer model with several baselines on the MOT17 test sets, and the results are presented in Table 1. With whistles and bells, P3AFormer outperforms all previous approaches on the two major metrics of MOTA and IDF1. Besides, P3AFormer surpasses the concurrent unpublished transformer-based approaches by a large margin (4.5% MOTA and 3.0% IDF1). P3AFormer outperforms the strong unpublished baseline ByteTrack while our model differs from theirs. Further, our association procedure does not involve additional training parameters, unlike those of .

We also report results on the MOT20 test server in Table 2. Again, P3AFormer demonstrates superior performance with whistles and bells. It outperforms SOTA methods and even the unpublished work . Besides, P3AFormer outperforms the concurrent transformer-based work by a large margin (13.1% MOTA and 17.0% IDF1). It achieves the best results on this leaderboard.

A comparison between P3AFormer and the baselines on the KITTI dataset is given in Table 3. Our work outperforms all baselines on two object classes. Notably, P3AFormer surpasses the strong baseline PermaTrack that leverages additional synthetic training data to overcome occlusions. Intriguingly, our P3AFormer does not need those additional training data.

4 Effectiveness of pixel-wise techniques

We decouple the P3AFormer’s pixel-wise techniques and validate the contribution of each part. We use “Pro.” to denote the pixel-wise feature propagation, “Pre.” to denote pixel-wise prediction, and “Ass.” to denote the pixel-wise association. The details of the ablated models are in the supplementary file.

The results are presented in Table 6. We also report the detection mean average precision (mAP ). The results are much worse when removing all pixel-wise techniques (the first row of Table 6). Compared to the last row, the incomplete system yields 9.2% mAP, 10.1% MOTA, and 9.2% IDF1 lower results.

When we remove the pixel-wise propagation or pixel-wise prediction (2nd and 3rd rows of Table 6), the results are worse in terms of the detection mAP. Finally, we try different combinations of pixel-wise techniques (4th and 5th rows of Table 6). These combinations improve the tracking performance.

5 Influence of training techniques

P3AFormer also incorporates several techniques for training transformers. The results are presented in Table 6. We use “Mask.” to represent mask attention – it is beneficial to detection (0.3 mAP) and association (1.3% IDF1). We then verify the effect of mixing datasets (CrowdHuman , Cityperson and ETHZ ) by comparison with only using MOT17 dataset (denoted as “w/o Mix.” in Table 6). It is also clear that using external datasets improves detection and tracking performance. Besides, we notice that using Mosaic augmentation (4th row of Table 6), using learnable query (5th row in Table 6) and connecting all bounding boxes (6th row in Table 6) all slightly improve P3AFormer.

6 Generalizing to other trackers

Although our pixel-wise techniques are implemented on the transformer structure, one can apply the pixel-wise techniques to other trackers. We consider the tracking-by-detection tracker Tractor , which is based on Faster R-CNN with a camera motion model and a re-ID network.

First, we apply pixel-wise feature propagation to Tractor. Second, we change the output shape of faster-RCNN to predict pixel-wise information. After that, we remove the association procedure of the Tractor and replace it with our dense association scheme. More details are included in this generalization experiment in the supplementary file. The results of the above models are presented in Table 6. It is clear that tracking objects as pixel-wise distributions also improves CNN-based frameworks.

7 Visualization of results

Visualization of tracking results in comparison to several transformer-based approaches is provided in Figure 1. The P3AFormer can robustly track small objects without many ID switches. Besides, we provide the visualization of center heatmaps and tracking results of P3AFormer in Figure 4. Even when the objects are heavily occluded, the predicted pixel-wise distribution can provide useful clues to recover the relationship between objects.

Conclusion

In this paper, we have presented the P3AFormer, which tracks objects as pixel-wise distributions. First, P3AFormer adopts dense feature propagation to build connections through frames. Second, P3AFormer generates multi-level heatmaps to preserve detailed information from the input images. Finally, the P3AFormer exploits pixel-wise predictions during the association procedure, making the association more robust.

P3AFormer obtains state-of-the-art results on three public benchmarks of MOT17, MOT20, and KITTI. P3AFormer demonstrates strong robustness against occlusion and outperforms concurrent transformer-based frameworks significantly. The proposed pixel-wise techniques can be applied to other trackers and may motivate broader exploration of transformers in future work. We will also study the transformer architecture and make it more efficient to satisfy the real-time tracking requirement.