MixFormer: End-to-End Tracking with Iterative Mixed Attention
Yutao Cui, Cheng Jiang, Limin Wang, Gangshan Wu
Introduction
Visual object tracking has been a fundamental task in computer vision area for decades, aiming to estimate the state of an arbitrary target in video sequences given its initial status. It has been successfully deployed in various applications such as human computer interaction and visual surveillance . However, how to design a simple yet effective end-to-end tracker is still challenging in real-world scenarios. The main challenges are from aspects of scale variations, object deformations, occlusion, and confusion from similar objects.
Current prevailing trackers typically have a multi-stage pipeline as shown in Fig. 1. It contains several components to accomplish the tracking task: (1) a backbone to extract generic features of tracking target and search area, (2) an integration module to allow information communication between tracking target and search area for subsequent target-aware localization, (3) task-specific heads to precisely localize the target and estimate its bounding box. Integration module is the key of tracking algorithms as it is responsible for incorporating the target information to bridge the steps of generic feature extraction and target-aware localization. Traditional integration methods include correlation-based operations (e.g. SiamFC , SiamRPN , CRPN , SiamFC++ , SiamBAN , OCEAN ) and online learning algorithms (e.g., DCF , KCF , CSR-DCF , ATOM , DiMP , FCOT ). Recently, thanks to its global and dynamic modeling capacity, transformers are introduced to perform attention based integration and yields good tracking performance (e.g., TransT , TMT , STMTrack , TREG , STARK , DTT ). However, these transformer based trackers still depend on the CNN for generic feature extraction, and only apply attention operations in the latter high-level and abstract representation space. We analyze that these CNN representations are limited as they are typically pre-trained for generic object recognition and might neglect finer structure information for tracking. In addition, these CNN representations employ local convolutional kernels and lack global modeling power. Therefore, CNN representation is still their bottleneck, which prevents them from fully unleashing power of self-attention for the whole tracking pipeline.
To overcome the above issue, we present a new perspective on tracking framework design that generic feature extraction and target information integration should be coupled together within a unified framework. This coupled processing paradigm shares several key advantages. First, it will enable our feature extraction to be more specific to the corresponding tracking target and capture more target-specific discriminative features. Second, it also allows the target information to be more extensively integrated into search area, and thereby to better capture their correlation. In addition, this will result in a more compact and neat tracking pipeline only with a single backbone and tracking head, without an explicit integration module.
Based on the above analysis, in this paper, we introduce the MixFormer, a simple tracking framework designed for unifying the feature extraction and target integration solely with a transformer-based architecture. Attention module is a very flexible architectural building block with dynamic and global modeling capacity, which makes few assumption about the data structure and could be generally applied for general relation modeling. Our core idea is to utilize this flexibility of attention operation, and present a mixed attention module (MAM) that performs both of feature extraction and mutual interaction of target template and search area at the same time. In particular, in our MAM, we devise a hybrid interaction scheme with both self-attention and cross-attention operations on the tokens from target template and search area. The self-attention is responsible to extract their own features of target or search area, while the cross-attention allows for the communications between them to mix the target and search area information. To reduce computational cost of MAM and thereby allow for multiple templates to handle object deformation, we further present a customized asymmetric attention scheme by pruning the unnecessary target-to-search area cross-attention.
Following the successful transformer architecture in image recognition, we build our MixFormer backbone by stacking the layers of Patch Embedding and MAM, and finally place a simple localization head to yield our whole tracking framework. As a common practice in dealing with object deformation during tracking procedure, we also propose a score based target template update mechanism and our MixFormer could be easily adapted for multiple target template inputs. Extensive experiments on several benchmarks demonstrate that MixFormer sets a new state-of-the-art performance, with a real-time running speed of 25 FPS on a GTX 1080Ti GPU. Especially, MixFormer-L surpasses STARK by 5.0% (EAO score) on VOT2020, 2.9% (NP score) on LaSOT and 2.0% (NP score) on TrackingNet.
The main contributions are summarized as follows:
We propose a compact end-to-end tracking framework, termed as MixFormer, based on iterative Mixed Attention Modules (MAM). It allows for extracting target-specific discriminative features and extensive communication between target and search simultaneously.
For online template update, we devise a customized asymmetric attention in MAM for high efficiency, and propose an effective score prediction module to select high-quality templates, leading to an efficient and effective online transformer-based tracker.
The proposed MixFormer sets a new state-of-the-art performance on five challenging benchmarks, including VOT2020 , LaSOT , TrackingNet , GOT-10k , and UAV123 .
Related Work
Current prevailing tracking methods can be summarized as a three-parts architectures, containing (i) a backbone to extract generic features, (ii) an integration module to fuse the target and search region information, (iii) heads to produce the target states. Generally, most trackers used ResNet as the backbone. For the most important integration module, researchers explored various methods. Siamese-based trackers combined a correlation operation with the Siamese network, modeling the global dependencies between the target and search. Some online trackers learned an target-dependent model for discriminative tracking. Furthermore, some recent trackers introduced a transformer-based integration module to capture more complicated dependencies and achieved impressive performance. Instead, we propose a fully end-to-end transformer tracker, solely containing a MAM based backbone and a simple head, leading to a more accurate tracker with neat and compact architecture.
Vision Transformer.
The Vision Transformer (ViT) first presented a pure vision transformer architecture, obtaining an impressive performance on image classification. Some works introduced design changes to better model local context in vision Transformers. For example, PVT incorporated a multi-stage design (without convolutions) for Transformer similar to multi-scales in CNNs. CVT combined CNNs and Transformers to model both local and global dependencies for image classification in an efficient way. Our MixFomer uses the pre-trained CVT models, but there are some fundamental differences. (i) The proposed MAM performs dual attentions for both feature extraction and information integration, while CVT uses self attention to solely extract features. (ii) The learning tasks are different, and the corresponding input and the head are different. We use multiple templates together with the search region as input and employ a corner-based or query-based localization head for bounding box generation, while CVT is designed for image classification. (iii) We further introduce an asymmetric mixed attention and a score prediction module for the specific task of online tracking.
Attention machenism has been also explored in object tracking recently. CGCAD and SiamAttn introduced a correlation-guided attention and self-attention to perform discriminative tracking. TransT designed a transformer-based fusion network for target-search information incorporation. These methods still relied on post-processing for box generation. Inspired by DETR , STARK further proposed an end-to-end transformer-based tracker. However, it still followed the paradigm of Backbone-Integration-Head, with separate feature extraction and information integration modules. Meanwhile, TREG proposed a target-aware transformer for regression branch and can generate accurate prediction in VOT2021 . Inspired by TREG, we formulate mixed attention mechanism by using both self attention and cross attention. In this way, our MixFormer unifies the two processes of feature extraction and information integration with an iterative MAM based backbone, leading to a more compact, neat and effective end-to-end tracker.
Method
In this section, we present our end-to-end tracking framework, termed as MixFormer, based on iterative mixed attention modules (MAM). First, we introduce our proposed MAM to unify the process of feature extraction and target information incorporation. This simultaneous processing scheme will enable our feature extraction to be more specific to the corresponding tracking target. In addition, it also allows the target information integration to be performed more extensively and thus to better capture the correlation between target and search area. Then, we present the whole tracking framework of MixFormer, which only includes a MAM-based backbone and localization head. Finally, we describe the training and inference of MixFormer by devising a confidence score based target template update mechanism to handle object deformation in tracking procedure.
Mixed attention module (MAM) is the core design to pursue a neat and compact end-to-end tracker. The input to our MAM is the target template and search area. It aims to simultaneously extract their own long-range features and fuse the interaction information between them. In contrast to the original Multi Head Attention , MAM performs dual attention operations on two separate tokens sequences of target template and search area. It carries out self-attention on tokens in each sequence themselves to capture the target or search specific information. Meanwhile, it conducts cross-attention between tokens from two sequences to allow communication between target template and search area. As shown in Fig. 2, this mixed attention mechanism could be implemented efficiently via a concatenated token sequence.
Formally, given a concatenated tokens of multiple targets and search, we first split it into two parts and reshape them to 2D feature maps. In order to achieve additional modeling of local spatial context, a separable depth-wise convolutional projection layer is performed on each feature map (i.e., query, key and value). It also provides efficiency benefits by allowing the down-sampling in key and value matrices. Then each feature map of target and search is flattened and processed by a linear projection to produce queries, keys and values of the attention operation. We use , and to represent target, , and to represent search region. The mixed attention is defined as:
where represents the dimension of the key, and are the attention maps of the target and search respectively. It contains both self attention and cross attention which unifies the feature extraction and information integration. Finally, the targets token and search token are concatenated and processed by a linear projection.
Asymmetric mixed attention scheme. Intuitively, the cross attention from the targets query to search area is not so important and might bring negative influence due to potential distractors. To reduce computational cost of MAM and thereby allowing for efficiently using multiple templates to deal with object deformation, we further present a customized asymmetric mixed attention scheme by pruning the unnecessary target-to-search area cross-attention. This asymmetric mixed attention is defined as follows:
In this manner, the template tokens in each MAM could remain unchanged during tracking process since it avoids influence by the dynamic search regions.
To better expound the insight of the mixed attention, we make a comparison with the attention mechanism used by other transformer trackers. Different with our mixed attention, TransT uses ego-context augment and cross-feature augment modules to perform self attention and cross attention progressively in two steps. Compared to the transformer encoder of STARK , our MAM shares a similar attention mechanism but with three notable differences. First, we incorporate the spatial structure information with a depth-wise convolution while they use positional encoding. More importantly, our MAM is built as a multi-stage backbone for both feature extraction and information integration, while they depend on a separate CNN backbone for feature extraction and only focus on information integration in a single stage. Finally, we also propose a different asymmetric MAM to further improve the tracking efficiency without much accuracy drop.
2 MixFormer for Tracking
Based on the MAM blocks, we build the MixFormer, a compact end-to-end tracking framework. The main idea of MixFormer is to progressively extract coupled features for target template and search area, and deeply perform the information integration between them. Basically, it comprises two components: a backbone composed of iterative target-search MAMs, and a simple localization head to produce the target bounding box. Compared with other prevailing trackers by decoupling the steps of feature extraction and information integration, it leads to a more compact and neat tracking pipeline only with a single backbone and tracking head, without an explicit integration module or any post-processing. The overall architecture is depicted in Fig. 3.
MAM Based Backbone.
Our goal is to couple both the generic feature extraction and target information integration within a unified transformer-based architecture. The MAM-based backbone employs a progressive multi-stage architecture design. Each stage is defined by a set of MAM and MLP layers operating on the same-scaled feature maps with the identical channel number. All stages share the similar architecture, which consists of an overlapped patch embedding layer and target-search mixed attention modules (i.e., a combination of MAM and MLP layers in implementation).
Specifically, given templates (i.e., the first template and online templates) with the size of and a search region (a cropped region according to the previous target states) with the size of , we first map them into overlapped patch embeddings using a convolutional Token Embedding layer with stride and kernel size . The convolutional token embedding layer is introduced in each stage to grow the channel resolution while reducing the spatial resolution. Then we flatten the patch embeddings and concatenate them, yielding a fused token sequence with the size of , where equals to 64 or 192, and is 128, and is 320 in this work. After that, the concatenated tokens pass through target-search MAM to perform both feature extraction and target information incorporation. Finally, we obtain the token sequence of size . More details about the MAM backbones could be found in the Section 4.1 and Table 2. Before passed to the prediction head, the search tokens are split and reshaped to the size of . Particularly, we do not apply the multi-scale feature aggregation strategy, commonly used in other trackers (e.g., SiamRPN++ , STARK ).
Corner Based Localization Head.
Inspired by the corner detection head in STARK , we employ a fully-convolutional corner based localization head to directly estimate the bounding box of tracked object, solely with several -- layers for the top-left and the bottom-right corners prediction respectively. At last, we can obtain the bounding box by computing the expectation over corner probability distribution . The difference with STARK lies in that ours is a fully convolutional head while STARK highly relies on both encoder and the decoder with more complicated design.
Query Based Localization Head.
Inspired by DETR , we propose to employ a simple query based localization head. This sparse localization head can verify the generalization ability of our MAM backbone and yield a pure transformer-based tracking framework. Specifically, we add an extra learnable regression token to the sequence of the final stage and use this token as an anchor to aggregate information from entire target and search area. Finally, a FFN of three fully connected layers is employed to directly regress the bounding box coordinates. This framework does not use any post-processing technique either.
3 Training and Inference
Training. The training process of our MixFormer generally follows the standard training recipe of current trackers . We first pre-train our MAM with a CVT model , and then fine-tune the whole tracking framework on the target dataset. Specifically, a combination of loss and GIoU loss is employed as follows:
where and are the weights of the two losses, is the ground-truth bounding box and is the predicted bounding box of the targets.
Template Online Update. Online templates play an important role in capturing temporal information and dealing with object deformation and appearance variations. However, it is well recognized that poor-quality templates may lead to inferior tracking performance. As a consequence, we introduce a score prediction module (SPM), described in Fig. 4, to select reliable online templates determined by the predicted confidence score. The SPM is composed of two attention blocks and a three-layer perceptron. First, a learnable score token serves as a query to attend the search ROI tokens. It enables the score token to encode the mined target information. Next, the score token attends to all positions of the initial target token to implicitly compare the mined target with the first target. Finally, the score is produced by the MLP layer and a sigmoid activation. The online template is treated as negative when its predicted score is below than 0.5.
For the SPM training, it is performed after the backbone training and we use a standard cross-entropy loss:
where is the ground-truth label and is the predicted confidence score.
Inference. During inference, multiple templates, including one static template and dynamic online templates, together with the cropped search region are fed into MixFormer to produce the target bounding box and the confidence score. We update the online templates only when the update interval is reached and select the sample with the highest confidence score.
Experiments
Our trackers are implemented using Python 3.6 and PyTorch 1.7.1. The MixFormer training is conducted on 8 Tesla V100 GPUs. Especially, MixFormer is a neat tracker without post-processing, positional embedding and multi-layer feature aggregation strategy.
As shown in Table 2, we instantiate two models, MixFormer and MixFormer-L, with different parameters and FLOPs by varying the number of MAM blocks and the hidden feature dimension in each stage. The backbone of MixFormer and MixFormer-L are initialized with the CVT-21 and CVT24-W (first 16 layers are employed) pretrained on ImageNet respectively.
Training.
The training set includes TrackingNet , LaSOT , GOT-10k and COCO training dataset, which is the same as DiMP and STARK . While for GOT-10k test, we train our tracker by only using the GOT10k train split following its standard protocol. The whole training process of MixFormer consists of two stages, which contains the first 500 epochs for backbones and heads, and extra 40 epochs for score prediction head. We train the MixFormer by using ADAM with weight decay . The learning rate is initialized as - and decreased to - at the epoch of 400. The sizes of search images and templates are pixels and pixels respectively. For data augmentations, we use horizontal flip and brightness jittering.
Inference.
We use the first template and multiple online templates together with the current search region as input of MixFormer. The dynamic templates are updated when the update interval of 200 is reached by default. The template with the highest predicted score in the interval is selected to substitute the previous one.
2 Comparison with the state-of-the-art trackers
We verify the performance of our proposed MixFormer-1k, MixFormer-22k, and MixFormer-L on five benchmarks, including VOT2020 , LaSOT , TrackingNet , GOT10k , UAV123 .
VOT2020 consists of 60 videos with several challenges including fast motion, occlusion, etc. As shown in Table 1, MixFormer-L achieves the top-ranked performance on EAO criteria of 0.555, which outperforms the transformer tracker STARK with a large margin of 5% of EAO. MixFormer-22k also outperforms other trackers including RPT (VOT2020 short-term challenge winner).
LaSOT.
LaSOT has 280 videos in its test set. We evaluate our MixFormer on the test set to validate its long-term capability. The Table 3 shows that our MixFormer surpasses all other trackers with a large margin. Specifically, MixFormer-L achieves the top-ranked performance on NP of 79.9%, surpassing STARK by 2.9% even without multi-layers feature aggregation.
TrackingNet.
TrackingNet provides over 30K videos with more than 14 million dense bounding box annotations. The videos are sampled from YouTube, covering target categories and scenes in real life. We validate MixFormer on its test set. From Table 3, we find that our MixFormer-22k and MixFormer-L set a new state-of-the-art performance on the large scale benchmark.
GOT10k.
GOT10k is a large-scale dataset with over 10000 video segments and has 180 segments for the test set. Apart from generic classes of moving objects and motion patterns, the object classes in the train and test set are zero-overlapped. As shown in Table 3, our MixFormer-GOT obtain state-of-the-art performance on the test split.
UAV123.
UAV123 is a large dataset containing 123 Sequences with average sequence length of 915 frames, which is captured from low-altitude UAVs. Table 3 shows our results on UAV123 dataset. Our MixFormer-22k and MixFormer-L outperforms all other trackers.
3 Exploration Studies
To verify the effectiveness and give a thorough analysis on our proposed MixFormer, we perform a detailed ablation study on the large-scale LaSOT dataset.
As the core part of our MixFormer is to unify the procedure of feature extraction and target information integration, we compare it to the separate processing architecture (e.g. TransT ). The comparison results are shown in Table 4 #1, #2, #3 and #8. Experiments of #1 and #2 are end-to-end trackers comprising a self-attention based backbone, cross attention modules to perform information integration and a corner head. #3 is the tracker with CVT as backbone and TransT’s ECA+CFA(4) as interaction. Experiment of #8 is our MixFormer without multiple online templates and asymmetric mechanism, denoted by MixFormer-Base. MixFormer-Base largely increases the model of #1 (using one CAM) and #2 (using three CAMs) by 8.6% and 7.9% with smaller parameters and FLOPs. This demonstrates the effectiveness of unified feature extraction and information integration, as both of them would benefit each other.
Study on stages of MAM.
To further verify the effectiveness of the MAMs, we conduct experiments as in Table 4 #4, #5, #6, #7 and #8, to investigate the performance of different numbers of MAM in our MixFormer. We compare our MAM with the self-attention operations (SAM) with out cross-branch information communication. We find that more MAMs contribute to higher AUC score. It indicates that extensive target-aware feature extraction and hierarchical information integration play a critical role to construct an effective tracker, which is realized by the iterative MAM. Especially, when the number of MAM reaches 16, the performance reaches 68.1, which is comparable to the MixFormer-Base containing 21 MAMs.
Study on localization head.
To verify the generalization ability of our MAM backbone, we evaluate the MixFormer-Base with two types of localization head as described in Section 3.2 (fully convolutional head vs. query based head). The results are shown as in Table 4 #8 and #9 for the corner head and the query-base head respectively. MixFormer-Base with the fully convolutional corner head outperforms that of the query-based head. Especially, MixFormer-Base with corner head surpass all the other state-of-the-art trackers even without any post-processing and online templates. Besides, MixFormer-Base with the query head, a pure transformer-based tracking framework, obtains a comparable AUC of 66.0 with STARK-ST and KeepTrack and far exceed query-head STARK-ST of 63.7. It demonstrates the generalization ability of our MAM backbone.
Study on asymmetric MAM.
The asymmetric MAM is used to reduce computational cost and allows for usage of multiple templates during online tracking. As shown in Table 5, the asymmetric MixFormer-Base increases the running speed of 24% while achieving a comparable performance, which demonstrates asymmetric MAM is important for building an efficient tracker.
Study on online template update.
As demonstrated in Table 6, MixFormer with online templates, sampled by a fixed update interval, performs worse than that with only the first template, and the online MixFormer with our score prediction module achieves the best AUC score. It suggests that selecting reliable templates with our score prediction module is of vital importance.
Study on training and pre-training datasets.
To verify the generalization ablility of our MixFormer, we conduct an analysis on different pre-training and training datasets, as shown in Table 7. MixFormer pretrained by ImageNet-1k still outperforms all the SOTA trackers (e.g., TransT , KeepTrack , STARK ), even without post-processing and multi-layer feature aggregation. In addition, MixFormer trained with GOT-10k also achieves an impressive AUC of 62.1, which outperforms a majority of trackers trained with the whole tracking datasets.
Visualization of attention maps.
To explore how the mixed attention works in MixFormer backbone, we visualize some attention maps in Fig. 5. From the four types of attention maps, we derive that: (i) distractors in background get suppressed layer by layer, (ii) online templates may be more adaptive to appearance variation and help to discriminate the target, (iii) the foreground of multiple templates can be augmented by mutual cross attention, (iv) a certain position tends to interact with the surrounding local patch.
Conclusion
We have presented MixFormer, an end-to-end tracking framework with iterative mixed attention, aiming to unify the feature extraction and target integration and result in a neat and compact tracking pipeline. Mixed attention module performs both feature extraction and mutual interaction for target template and search area. In empirical evaluation, MixFormer shows a notable improvement over other prevailing trackers for short-term tracking. In the future, we consider extending MixFormer to multiple object tracking.
Acknowledgement. This work is supported by National Natural Science Foundation of China (No.62076119, No.61921006), Program for Innovative Talents and Entrepreneur in Jiangsu Province, and Collaborative Innovation Center of Novel Software Technology and Industrialization.
Appendix
In this appendix, we first provide more results and analysis on OTB100 and LaSOT datasets. Then we give more visualization results of the attention weights on LaSOT. Finally, we provide more training details.
A. More Results
OTB100 is a commonly used benchmark, which evaluates performance on Precision and AUC scores. Figure. 7 presents results of our trackers on both two metrics on OTB-100 benchmark. MixFormer-L reaches competitive performance w.r.t. state-of-the-art trackers, surpassing the transformer tracker TransT by 1.3% on AUC score. Besides, MixFormer-L is slightly higher than MixFormer.
LaSOT.
LaSOT has 280 videos in its test set. We evaluate our MixFormer on the test set to validate its long-term capability. To give a further analysis, we provide Success plot and Precision plot for LaSOT in Fig. 8. It proves that improvement is due to both higher accuracy and robustness.
B. More Visualization Results
In this section, we provide more visualization results of attention weights on car-2 of LaSOT test dataset in Fig. 6. From the example, we can arrive at the same conclusion with section 4.3. Besides, from the last two lines, we infer that the features of last two blocks tend to adapt to the bounding box prediction head.
C. Training Details
We propose a 320x320 search region plus two 128x128 input images to make a fair comparison with prevailing trackers (e.g., Siamese-based trackers, STARK and TransT). Generally, we use 8 Tesla V100 GPUSs to train MixFormer with batch size of 32. MixFormer can also be trained on 8 2080Ti GPUs having only 11GB memory, with batch size of 8 per GPU. We use CvT21 and CvT24-W as the pretrained model for MixFormer and MixFormer-L respectively. We apply gradient clip strategy with the clip normalization rate of 0.1. For training stage-1 of MixFormer (i.e., MixFormer without SPM), we use GIoU loss and loss, with the weights of 2.0 and 5.0 respectively. Besides, the Batch Normalization layers of MixFormer backbone are frozen during the whole training process. For SPM training process, the backbone and corner-based localization head are frozen and the batch size is 32. SPM is trained for 40 epochs with the learning rate decays at 30 epochs.