TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers

Qianyu Zhou, Xiangtai Li, Lu He, Yibo Yang, Guangliang Cheng, Yunhai Tong, Lizhuang Ma, Dacheng Tao

Introduction

Video Object Detection (VOD) extends image object detection to video scenarios, which aims to detect every object given video clips. It enables various applications in the real world, e.g., autonomous driving. However, still-image detectors cannot be directly applied to much challenging video data, due to the appearance deterioration and changes of video frames, e.g., motion blur, part occlusion, camera refocous and rare poses.

Previous VOD methods mainly leverage the temporal information in two different manners. The first one relies on post-processing of temporal information to make the detection results more coherent and stable. These methods usually apply a still-image detector to obtain detection results, then associate the results. Another line of approaches exploits the feature aggregation of temporal information. Specifically, they improve features of the current frame by aggregating that of adjacent frames or entire clips to boost the detection performance via specific operator design. In this way, the problems such as motion blur, part occlusion, and fast appearance change can be well solved. In particular, most methods use two-stage detector Faster-RCNN or R-FCN as the still-image baseline.

Despite the gratifying success of these approaches, most of the two-stage pipelines for video object detection are over sophisticated, requiring many hand-crafted components, e.g., optical flow model , recurrent neural network , deformable convolution fusion , relation networks . In addition, most of them need complicated post-processing methods by linking the same object across the video to form tubelets and aggregating classification scores in the tubelets to achieve the state-of-the-art performance . Meanwhile, there are also several studies focusing on real-time video object detection. However, these works still need sophisticated designs. Thus, it is in desperate need to build a simple yet effective VOD framework in a fully end-to-end manner.

Transformers have shown promising potential in computer vision. Especially, DETR simplifies the detection pipeline by modeling the object queries and achieving comparative performance with highly optimized CNN-based detectors. However, such static detectors cannot handle motion blur, part occlusion, video defocus, or rare poses well due to the lack of temporal information, which will be shown in the experiment part. Thus, how to model the temporal information in a long-range video clip is a very critical problem.

In this paper, our goal is to extend the DETR-like object detector into the video object detection domain. Our insights are four aspects. Firstly, we observe that the video clip contains rich inherent temporal information, e.g., rich visual cues of motion patterns. Thus, it is natural to view video object detection as a sequence-to-sequence task with the advantages of Transformers . The whole video clip is like a sentence, and each frame contributes similarly to each word in natural language processing. Transformers can not only be used in inner each frame to model the interaction of each object, but also be used to link objects along the temporal dimension. Secondly, object query is one key component design in DETR which encodes instance-aware information. The learning process of DETR can be seen as the grouping process: grouping each object into an object query. Thus, these query embeddings can represent the instances of each frame, and it is natural to link these sparse query embeddings via another temporal transformer. Thirdly, the output memory from the DETR transformer encoder contains rich spatial information which can also be modeled jointly with query embeddings along the temporal dimension. Fourthly, adopting clip-level inputs of Transformers can speed up the object detection process in a video, which is needed in many real-world applications.

Motivated by the above facts, we propose TransVOD, a novel end-to-end video object detection framework based on a spatial-temporal Transformer architecture. Our TransVOD views video object detection as an end-to-end sequence decoding/prediction problem. For the current frame, as shown in Fig. (2)(a), it takes multiple frames as inputs and directly outputs the current frame detection results via a Transformer-like architecture. In particular, we design a novel temporal Transformer to link each object query and outputs of memory encodings simultaneously. Our proposed temporal Transformer mainly contains three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries in one video clip, and Temporal Deformable Transformer Decoder (TDTD) to obtain the final detection results of the current frame. TDTE efficiently aggregates the spatial information via temporal deformable attention and avoids the background noises. TQE first adopts a coarse-to-fine strategy to select relevant object queries in one clip and fuse such selected queries via several self-attention layers . TDTD is another decoder that takes the outputs of TDTE and TQE as inputs, and directly outputs the final detection results. These modules are shared for each frame and can be trained in an fully end-to-end manner. We carry out extensive experiments on ImageNet VID dataset . Compared with the single-frame baseline , our TransVOD achieves significant improvements (2% $\sim$ 4% mAP).

Based on the TransVOD framework, which is published in ACM MM 2021 , we present two improved versions including TransVOD++ and TransVOD Lite. For TransVOD++, regarding that there exists large redundancy in both the number of object queries and the targets, we present a hard query mining (HQM) strategy to sample the hardest queries during the training inspired from the hard pixels mining in image object detection and segmentation , as shown in Fig. 2(b). Moreover, we present a novel query and RoI fusion (QRF) module via dynamic convolutions. In this way, the object-level appearance information is injected into each query and TDTE can be avoided since the spatial fusion can be replaced with QRF. Compared with previous TransVOD, we find both improvements lead to better results with faster speed. Moreover, when deploying the vision Transformer backbone , we present a simply-aligned fusion to fuse multi-scale features for TDTD. After adopting Swin base as the backbone, our TransVOD++ achieves 90% mAP on the ImageNet VID dataset and suppress previous works by a significant margin (5% $\sim$ 6 %) with a simpler pipeline. Our method is the first to achieve 90% mAP on ImageNet VID dataset.

Inherited from TranVOD, we present TransVOD Lite, aiming at real-time VOD and modeling the VOD task as a sequence-to-sequence prediction problem which is adopted in machine translation . The pipeline is shown in Fig. 2(c). In particular, given a window size $T$ ( $T$ can be chosen in 8, 16), we take multiple frames as inputs and obtain multiple frame results simultaneously. Then, one video clip results can be obtained in a temporal window manner. In this way, we can fully use the memory of GPU to speed up inference time. Our TransVOD Lite can boost the single image baseline by 2% $\sim$ 3% mAP but with a faster speed (4x-6x). After adopting the Swin Transformer, as shown in Fig. 1, our methods achieve the best speed and accuracy trade-off. Our methods lead to a significant margin (3% $\sim$ 4%mAP, 5 $\sim$ 15 FPS) compared with previous VOD methods in both speed and accuracy. Our best model can achieve 83.7% mAP while running at around 30 FPS. In summary, following the TransVOD framework, we present TransVOD++ and TransVOD Lite. Both models set new state-of-the-art results on the challenging ImageNet VID dataset in two different settings: accuracy for non-real-time models and best speed-accuracy trade-off on real-time models. These results indicate our method can be new solid baseline for VOD.

Related work

Video Object Detection. VOD task requires detecting objects in each frame and linking the same objects across frames. State-of-the-art methods typically develop sophisticated pipelines to tackle it. In general, VOD task can be divided into two directions: improving detection accuracy via temporal fusing and performing real-time video object detection while keeping the accuracy.

For the first aspect, most previous works to amend this problem is feature aggregation that enhances per-frame features by aggregating the features of nearby frames. Earlier works adopt flow-based warping to achieve feature aggregation. Specifically, FGFA and THP both utilize the optic flow from FlowNet to model the motion relation via different temporal feature aggregation strategies. To calibrate the pixel-level features with inaccurate flow estimation, MANet dynamically combines pixel-level and instance-level calibration according to the motion. Nevertheless, these flow-warping-based methods have several disadvantages: 1) Training a model for flow extraction requires large amounts of flow data, which may be difficult and costly to obtain. 2) integrating a flow network and a detection network into a single model may be challenging due to multitask learning. Another line of attention-based approaches utilize self-attention and non-local to capture long-range dependencies of temporal contexts. SELSA treats video as a bag of unordered frames and proposes to aggregate features in the full-sequence level. STSN and TCENet propose to utilize deformable convolution to aggregate the temporal contexts within a complicated framework with so many heuristic designs. RDN introduces a new design to capture the interactions across the objects in spatial-temporal context. LWDN adopts a memory mechanism to propagate and update the memory feature from key frames to key frames. OGEMN present to use object-guided external memory to store the pixel and instance-level features for further global aggregation. MEGA considers aggregating both the global information and local information from the video and presents a long-range memory. Despite the great success of these approaches, most of the pipelines for VOD are too sophisticated, requiring many hand-crafted components, e.g., extra optic flow model, memory mechanism, or recurrent neural network. In addition, most of them need complicated post-processing methods such as Seq-NMS , Tubelet rescoring , Seq-Bbox Matching or REPP by linking the same object across the video to form tubelets and aggregating classification scores in the tubelets to achieve the state-of-the-art. Instead, our previous work TransVOD builds a simple and end-to-end trainable VOD framework without these designs. Beyond that, our improved version TransVOD++ incorporates more appearance information into the object query design and simplifies the whole pipeline by removing the temporal encoder (TDTE) of original TransVOD. It achieves better results than TransVOD and the state-of-the-art performances on the ImageNet VID dataset.

For the second aspect, starting from DFF , several works focus on real-time video object detection while keeping accuracy unchanged or even improved. In general, most of these works also perform specific architecture designs with many hand-crafted components and human prior such as object-level tracker in , patchwork cell with attention in and Convolutional LSTMs in . Our proposed TransVOD Lite models the entire VOD pipeline as a sequence to sequence problem, as Transformer did in machine translation . It achieves significant improvements over the strong image baseline along with a faster speed.

Vision Transformers. Recently, vision Transformers make a great progress. It can be mainly divided into two directions: replacing CNN backbone with Transformer-Like architecture and using object query to represent instance for scene understanding . Our work is related to the second part. DETR builds a fully end-to-end object detection system based on Transformers, which largely simplifies the traditional detection pipeline. It also achieves on par performances compared with highly-optimized CNN-based detectors . However, it suffers from slow convergence and limited feature spatial resolution, Deformable DETR improves DETR by designing a deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. Our work is inspired by DETR and Deformable DETR . The above works show the effectiveness of Transformers in image object detection tasks. There are several con-current works that applied Transformer into video understanding, e.g., video instance segmentation (VIS) , multi-object tracking (MOT). TransTrack introduces a query-key mechanism into the multi-object tracking model, while Trackformer directly adds track query for MOT. However, both only leverage limited temporal information, i.e., just the previous frame. We suppose that this way can not fully use enough temporal contexts from a video clip. VisTR views the VIS task as a direct end-to-end parallel sequence prediction problem. The targets of a clip are disrupted in such an instance sequence, and directly performing target assignment is not optimal. Instead, we aim to link the outputs of the spatial Transformer, i.e., object query, through a temporal Transformer, which acts in a completely different way from VisTR . To our knowledge, there are no prior applications of Transformers to video object detection (VOD) tasks so far. It is intuitive to see that the Transformers’ advantage of modeling long-range dependencies in learning temporal contexts across multiple frames for VOD task. Our previous work, TransVOD , leverages both the spatial Transformer and the temporal Transformer, and then provide an affirmative answer to that. In this paper, based on the TransVOD framework, we provide two extra solutions including TransVOD++ and TransVOD Lite. The former aims to improve the performance of TransVOD while keeping inference efficiency, while the latter carry out real-time VOD detection with much faster inference speed.

Method

Overview. We first review the previous works, including both DETR and Deformable DETR in Sec. 3.1. Then, we give detailed descriptions of our proposed TransVOD framework in Sec. 3.2. It contains three key components: Temporal Deformable Transformer Encoder (TDTE), Temporal Query Encoder (TQE), and Temporal Deformable Transformer Decoder (TDTD). Then, we present two advanced versions of our TransVOD including TransVOD++ (Sec. 3.3 ) and TransVOD Lite (Sec. 3.4). Finally, we describe the loss functions and details of inference in Sec. 3.5.

where $m$ indexes the attention head, $W^{\prime}_{m}\in R^{C_{v}\times C}$ and $W_{m}\in R^{C\times C_{v}}$ are learnable weights ( $C_{v}=C/M$ by default). The attention weights $A_{mqk}$ are normalized as:

where $U_{m},V_{m}\in R^{C_{v}\times C}$ are learnable weights. The features $z_{q}$ and $x_{k}$ are the concatenation/summation of element contents and positional embeddings in practice. The decoder’s output features of each object query are then further transformed by a Feed-Forward Network (FFN) to output class score and box location for each object. Given box and class prediction, the Hungarian algorithm is applied between predictions and ground-truth box annotations to identify the learning targets of each object query for one-to-one matching. Deformable DETR replaces the multi-head self-attention layer with a deformable attention layer to efficiently sample local pixels rather than all pixels. Moreover, to handle missing small objects, they also propose a cross-attention module that incorporates multi-scale feature representations. Due to the fast convergence and computation efficiency, we adopt Deformable DETR as our still image Transformer detector.

2 TransVOD Framework

The overall TransVOD architecture is shown in Fig. 3. It takes multiple frames of a video clip as inputs and outputs the detection results for the current frame. It contains four main components: Spatial Transformers for single frame object detection, extracting both object queries and compact features representation (memory for each frame), Temporal Deformable Transformer Encoder (TDTE) to fuse memory outputs from Spatial Transformers, Temporal Query Encoder (TQE) to link objects in each frame along the temporal dimension and Temporal Deformable Transformer Decoder (TDTD) to obtain final outputs for the current frame.

Spatial Transformer. We use Deformable DETR as our still image detector. In particular, to simplify complex designs in , we do not use multi-scale features in both Transformer encoders and decoders. We only use the last stage of the backbone as the input of the deformable Transformer. The modified detector includes Spatial Transformer Encoder (STE) and Spatial Transformer Decoder (STD), which encodes each frame $F$ (including Reference Frame and Current Frame) into two compact representations: spatial object query $Q$ and memory encoding $E$ .

Temporal Deformable Transformer Encoder. The goal of TDTE is to encode the spatial-temporal feature representations and provide the location cues for the final decoder output. Since most adjacent features contain similar appearance information, directly using naive Transformer encoders may bring much extra computation (much useless computation on object background). Deformable attention samples only partial information efficiently according to the learned offset field. Thus, we can link these memory encodings $E_{t}$ through this operation in a temporal dimension. The core idea of the temporal deformable attention modules is that we only attend to a small set of key sampling points around a reference efficiently. Thus, TDTE receives the feature memories of the reference frame and the current frame as inputs, and outputs the enhanced current memory. The multi-head temporal deformable attention (TempDeformAttn) is as follows:

where $m$ indexes the attention head, $l$ indexes the frame sampled from the same video clip, and $k$ indexes the sampling points, and $\Delta p_{mlqk}$ and $A_{mlqk}$ indicate the sampling offset and attention weights of the $k^{\text{th}}$ sampling point in the $l^{\text{th}}$ frame and the $m^{\text{th}}$ attention head, respectively. $A_{mlqk}$ denotes the scalar attention weight in the range of $ $, normalized by$ \sum_{l=1}^{L}\sum_{k=1}^{K}A_{mlqk}=1 $.$ \Delta p_{lmqk}\in R^{2} $are of 2-d real numbers with unconstrained range. Since$ p_{q}+\Delta p_{mlqk} $is fractional, we apply bilinear interpolation in for computing$ x(p_{q}+\Delta p_{mlqk}) $. For each frame$ l $, both$ \Delta p_{mlqk} $and$ A_{mlqk} $are calculated by feeding the query feature$ z_{q} $to a linear projection of$ 3MK $channels, where the first$ 2MK $channels encode the sampling offsets$ \Delta p_{mlqk} $, and the remaining$ MK $channels are fed to a$ \operatorname{Softmax} $function to obtain the attention weights$ A_{mlqk} $. Here, we use normalized coordinates$ \hat{p}_{q}\in^{2} $for the clarity of scale formulation, in which$ (0,0) $and$ (1,1) $indicate the top-left and the bottom-right image corners, respectively.$ \phi_{l}(\hat{p}_{q}) $re-scales the normalized coordinates$ \hat{p}_{q} $to the input feature map of$ l $-th frame. The multi-frame temporal deformable attention samples$ LK $points from$ L $feature maps instead of$ K $points from single-frame feature maps. There exist total$ M$ attention heads in each TDTE layer.

Temporal Query Encoder. As mentioned in the previous part, learnable object queries can be regarded as the non-geometric anchors, which automatically learns the statistical features of the whole still image datasets during the training process. It means that the spatial object queries are not related to temporal contexts across different frames. Thus, we propose a simple yet effective encoder to measure the interactions between the objects in the current frame and the objects in reference frames.

Our key idea is to link these spatial object queries in each frame via a temporal Transformer, and thus learn the temporal contexts across different frames. We name our module Temporal Query Encoder (TQE). TQE takes all the spatial queries from reference frames to enhance the spatial output query of the current frame, and it outputs the temporal query for the current frame. Moreover, inspired from , we design a coarse-to-fine spatial object query aggregation strategy to progressively schedule the interactions between the current object query and the reference object queries. The benefit of such a coarse-to-fine design is that we can reduce the computation cost to some extent.

Specifically, we combine the spatial object query from all reference frames, denoted as $Q_{ref}$ . Then, we perform the scoring and selection in a coarse-to-fine manner. In particular, we use an extra Feed Forward Networks (FFN) to predict the class logits, which are jointly trained with the spatial Transformers and the parameters are fixed when fine-tuning the temporal Transformers. After that, we get the sigmoid value of that: $p=Sigmoid[FFN(Q_{ref})]$ . Then, we sort all the reference points by $p$ value and select the top-confident $k$ values from these reference points. The higher $p$ score means most likely objects and trained jointly with classification loss. The prediction head is only trained for image object detection and is fixed for the training of video object detection. As most current DETR-like detectors use the cascaded heads to refine detection results, we adopt a similar coarse-to-fine design to select less but precise object queries in the latter stages since most queries are not used and duplicated in the latter stages.

As shown in the blue part of Fig. 3, TQE includes a self-attention layer, cross-attention, and FFN. The temporal object queries are progressively refined and interacted with the spatial object queries extracted from different frames, calculating the co-attention between the reference queries and the query feature of the current frame. Note that the cross-attention plays the role of a cascade feature refiner which updates the output queries of each spatial Transformer iteratively. As such, TQE receives the object queries of the reference frames and the current frame as inputs and outputs the refined temporal object query of the current frame.

Temporal Deformable Transformer Decoder. This decoder aims to obtain the current frame output according to both outputs from TDTE (fused memory encodings) and TQE (temporal object queries). Given the aggregated feature memories $\hat{E}$ and the temporal queries $\hat{O_{q}}$ , our Temporal Deformable Transformer Decoder (TDTD) performs co-attention between online queries and the temporal aggregated features. The deformable co-attention of the temporal decoder layer is shown as follows:

where $m$ indexes the attention head, $k$ indexes the sampled keys, and $K$ is the total number of the sampled keys ( $K\ll HW$ ). $p_{mqk}$ and $A_{mqk}$ indicate the sampling offset and attention weight of the $k^{\text{th}}$ sampling point in the $m^{\text{th}}$ attention head, respectively. The attention weight $A_{mqk}\in$ , normalized by $\sum_{k=1}^{K}A_{mqk}=1$ . $\Delta p_{mqk}\in R^{2}$ are of 2-d real numbers with unconstrained range. Due to the fact that $p_{q}+\Delta p_{mqk}$ is fractional, we also adopt bilinear interpolation in computing $x(p_{q}+\Delta p_{mqk})$ following . Both $\Delta p_{mqk}$ and $A_{mqk}$ are obtained via linear projection over the query feature $z_{q}$ . In our implementation, the query feature $z_{q}$ is fed to a linear projection operator. The output of TDTD is sent to one feed-forward network (FFN) for the final classification and box regression as the detection results of the current frame.

3 TransVOD++

Compared with previous work, despite TransVOD simplifying the pipeline of VOD, it has several limitations. Firstly, it contains heavy computation costs in TDTE. Secondly, the performance of TransVOD is still limited. To solve these problems, we present TransVOD++ which contains the following improvements including Query and RoI Fusion (QRF), Hard Query Mining (HQM), and a strong backbone. The pipeline is shown in Fig 4.

Query and RoI Fusion. Previous works show that region features are useful and contain precise appearance information for temporal fusion. Our motivation is to replace TDTE with features in region of interest (RoI) via the proxy strategy where each RoI feature is injected into each query, thus utilizing the object-level appearance information to enhance the object query.

Specifically, given the detection boxes from spatial Transformers, we get the region of interest (RoI) of each frame in a video clip. Then, according to those RoIs and the feature from the STE, we calculate the RoI feature $E^{RoI}_{cur}$ and $E^{RoI}_{ref}$ of the current frame and the reference frames, respectively. Next, the cropped RoI features are used to weigh each query via the transformation of MLP, as shown in the green part of Fig. 4. The current RoI feature $E^{RoI}_{cur}$ is aggregated onto the object query of the current frame to generate the enhanced current query feature $\hat{Q}_{cur}$ , where feature aggregation is conducted through dynamic convolutions.

where $Q^{j}_{cur}$ is the spatial object query of the current frame before the $j_{th}$ temporal query encoder (TQE), and $\hat{Q}^{j}_{cur}$ denotes the temporal object query before the $j_{th}$ TQE module. Similarly, for each reference frame, the reference RoI features $E^{RoI}_{cur}$ of the $i_{th}$ frame are fused with the reference query of the $i_{th}$ frame via QRF.

The details of the QRF module are described as follows: given the object query and RoI feature memory, we first feed the object query to a multi-head self-attention layer to reason about the relations between objects. Then, each RoI feature will interact with the corresponding object query to filter out ineffective bins and outputs the final object query. Inspired from , we carry out two consecutive $1\times 1$ convolutions with ReLU activation function for light design. The $k_{th}$ object query generates dynamic parameters of these two convolutions for the corresponding $k_{th}$ RoI feature via a linear projection. Finally, the aggregated reference queries $\hat{Q}^{j}_{ref}$ are used to enhance the aggregated current query $\hat{Q}^{j}_{cur}$ via a TQE, thus learning the temporal contexts across different frames, which is described as: $\hat{Q}^{j}_{cur}=\text{TQE}(\hat{Q}^{j}_{cur},\hat{Q}^{j}_{ref})$

Hard Query Mining. Considering that both the spatial object queries and temporal object queries contain much redundant information across the dataset, for example, 300 queries reflect the temporal appearance distributions of 30 categories, and those queries need to match more than 300 ground truths during the training procedure, and there is no need to maintain so many object queries/targets in both the spatial and temporal dimension. As such, we are motivated to dynamically reduce the redundancy of query number and target number in the training of temporal Transformers, and meanwhile, we mine the hardest query in both the current frame and the reference frames.

Concretely, given the spatial object query $Q_{ref}$ of the reference frames, we fed it into a Query Filter Head (QFH), which filters the redundant object query and select the most confident ones to reduce the computation redundancy. Specifically, $Q_{ref}$ are fed forward to the class embedding layer of the spatial Transformer, i.e., a linear classification layer with sigmoid activation, to generate class logits. Then, those reference queries are concatenated in the dimension of the query number. Next, according to the probability of the reference logits, we sort and then select the top $k$ confident query, which is illustrated in the salmon part of Fig. 4. Inherited from TransVOD, we adopt the coarse-to-fine query aggregation strategy to progressively model the relationships between the current query and the reference queries via TQE module.

The differences between TransVOD++ and TransVOD lie in several aspects. Firstly, in contrast to TransVOD that only selects the reference query, our TransVOD++ selects not only the reference query but also the current query. Both of them are treated differently in a coarse-to-fine manner, thus reducing the computation cost in the temporal Transformer. Secondly, compared to TransVOD, we add a Temporal Defomrable Transformer Decoder (TDTD) after each TQE module and supervise the object query with different query numbers via an auxiliary TDTD loss, denoted as $\mathcal{L}_{aux}$ . We find it helpful to use auxiliary TDTD losses $\mathcal{L}_{aux}$ in temporal Transformer during training, especially to help the model output the correct number of objects of each class. We add prediction FFNs and Hungarian loss after each TDTD module. All prediction FFNs share their parameters.

Strong Backbone. We further adopt Swin Transformer as the strong backbone network. However, Swin Transformer generates multi-scale features adopted with FPN-like framework which is not suitable for our TransVOD framework. We propose a simple yet effective solution via fusing multi-scale features into one scale where we directly add multi-scale features into one scale.

4 TransVOD Lite

Despite TransVOD and TransVOD++ make the VOD pipeline much simpler, the inference time is still limited due to multiple frame query fusing. As mentioned in Section 2, the inference time is critical for real-world applications. To embrace the advantage of modeling sequence data in Transformer , we present TransVOD Lite where it takes multi frames as inputs and output detection results of all frames directly, as shown in Fig. 5.

Direct Multiple Frame Predictions. In TransVOD Lite, we abandon the feature aggregation paradigm, which requires much more computation costs in terms of time and memory space. Instead, a sequence of video clips is fed as input and output a sequence of results. As shown in Fig. 5, TransVOD Lite inherits the Hard Query Mining from the TransVOD++ and spatial-temporal transformer design in TransVOD including TQE, and TDTD. The main difference is that TransVOD Lite directly outputs the multiple frame prediction with a hyper-parameter $T_{w}$ which is the temporal window size of the input clip or the number of the input frames. When $T_{w}$ is larger, the inference speed is faster while the memory is increased. In this way, we can fully use the memory of GPU to speed up the inference time. We provide detailed experiments on the effect of choosing $T_{w}$ in the experiment part.

Sequential Hard Query Mining. Different from TransVOD and TransVOD++, we do not need to discriminate whether an object query is the reference query or the current query for filtering, all object queries in the whole sequence are equally selected in a coarse-to-fine manner, thus increasing the speed, e.g., FPS, to $T_{w}$ times in temporal Transformer than original TransVOD, where $T_{w}$ denotes the temporal window size in a given clip. We name our method “sequential hard query mining” (SeqHQM). For example, $T_{w}=12$ means the input frames are 12 in the video clip, and then we need to generate the results of those 12 frames, if each frame has 300 object queries, there are 3600 object queries in total. There is no doubt that there exists large redundant information of those large number of object queries, and it is necessary to dynamically reduce the computation costs to boost the inference speed, as well as achieve good results in modeling the temporal motion.

We then describe SeqHQM in detail. Specifically, a sequence of spatial object query $Q_{seq}$ , is fed forwarded into a Query Filter Head (QFH) to select the most credible object queries. The number of object queries and targets is dynamically decreasing to reduce the computation redundancy. For TransVOD Lite, we implement the QFH differently before the $k_{th}$ TQE module. If $k=1$ , we use the class embedding layer of the spatial Transformer to generate class logits and go through a sigmoid activation function, which is similar as QFH in TransVOD ++. If $k>1$ , the class logits are generated through the learnable temporal class embedding layer then with a sigmoid activation function. Next, we compute the maximum probability and select the top $k$ confident query by sorting and selection in a coarse-to-fine manner, which is illustrated in the green part of Fig. 5. Similar to TransVOD++, we add a TDTD after each TQE module and supervise the object query with different query numbers via an auxiliary TDTD loss, denoted as $\mathcal{L}_{aux}$ . $\mathcal{L}_{aux}$ is essential to help the model output the correct number of objects of each class. We add prediction FFNs and Hungarian loss after each TDTD module. All prediction FFNs share their parameters.

5 Loss Functions and Inference

Loss functions. Original DETR avoids post-processing and adopts a one-to-one label assignment rule. Following , we match predictions from STD/TDTD with ground truth by Hungarian algorithm and thus the entire training process of spatial Transformer is the same as original DETR. The temporal Transformer uses similar loss functions given the box and class prediction output by the FFN. The matching cost is defined as the loss function. Following , the loss function is:

where $J$ denotes the total number of TDTD modules in the temporal Transformers, where $J$ = 1 for TransVOD and $J$ = 3 for TransVOD++ and TransVOD Lite in all experiments. $\mathcal{L}_{\mathit{cls}}$ represents focal loss for classification. $\mathcal{L}_{\mathit{L1}}$ and $\mathcal{L}_{\mathit{giou}}$ represent L1 loss and generalized IoU loss in for localization. $\lambda_{cls}$ , $\lambda_{L1}$ and $\lambda_{giou}$ are coefficients of them. We balance these loss functions following the same setting in . For TransVOD Lite, we apply such a loss function for all input frames.

Inference for TransVOD Lite. In TransVOD Lite, the window size of a given video is defined as $T_{w}$ and the interval between the two adjacent frames within one clip is denoted as $I_{w}$ , respectively. Given a video $V=\{F_{1},F_{2},\cdots,F_{N}\}$ , we first expand the video size to the integer multiples of $T_{w}$ as: $\hat{N}=\lceil\frac{N}{T_{w}}\rceil T_{w}$ . Then, for each expanded video, we divide the video into two parts and adopt different sampling strategies for these two parts.

As for the first part, the clip is normal where the interval of different frames is $I_{w}$ . The index of the first frame in each video clip is $S=T_{w}I_{w}i+j$ , where $i\in\{0,1,\cdots,K-1\}$ , $j\in\{1,\cdots,I_{w}-1\}$ , $K=\lfloor\frac{\hat{N}}{T_{w}I_{w}}\rfloor$ . We feed the normal clip sequentially with window size $T_{w}$ and interval size $I_{w}$ into the model. For the second part, the frames are not divisible by $T_{w}I_{w}$ . The index of the first frame is the clip is $T_{w}k+1$ . There are $\hat{N}-TWk$ frames in this clip. Those frames are randomly divided into $\frac{\hat{N}}{T_{w}}-KI_{w}$ video clips, with the size of each clip as $T_{w}$ .

Besides, we introduce another sampling strategy using random shuffling. We find that if we first randomly shuffle $\hat{v}$ and split it to $\frac{\hat{N}}{T_{w}}$ clips, our model could model the temporal motions better due to the large view of the video. The empirical evidence perceived by the human visual system illustrates that when people are not certain about the identity of an object, they would seek to find a distinct object from other frames that share high semantic similarity with the current object and assign them together. Regarding that Transformers are effective in modeling long-range dependencies, if we randomly shuffle the video, we could increase the data diversity and fully utilize the global information of the video. The effectiveness of both strategies is demonstrated in Sec. 4.3.3.

Experiment

Datasets: We empirically conduct experiments on the ImageNet VID dataset which is a large-scale benchmark for video object detection. It contains 3862 training videos and 555 validation videos with annotated bounding boxes of 30 classes. Since the ground truth of the official testing set is not publicly available, following common VOD protocols , we train our models using a combination of ImageNet VID and DET datasets and measure the performance on the validation set using mean average precision (mAP) metric.

Network architectures: In this work, we use Deformable DETR as the image detector, and following , the detector is pre-trained on the COCO dataset . Following the widely used implementation protocols in previous works , We use ResNet-50 and ResNet-101 as the network backbone. Besides, we also adopt Swin Transformer as the backbone for better performances, which uses the same hyper-parameters as ResNet backbone. Note that we do not use the multi-scale features of the FPN-like network and fuse the multi-scale features by adding into the largest scale. All these backbones are pre-trained on ImageNet dataset. More implementation details could be referred in our code link.

Training details: Following Deformable DETR , we use the AdamW optimizer, the initial learning rate is $2\times 10^{-4}$ for Transformers, and $2\times 10^{-5}$ for the backbone, and weight decay is $10^{-4}$ . All Transformer weights are initialized with Xavier . The number of initial object queries is set as 300 for ResNet and 100 for Swin Transformer . During the training, the batch size is 1, and the number of reference frames is 14 for TransVOD and TransVOD++ in all experiments. Following the sampling strategy in MMTracking , we adopt the bilateral uniform sampling for reference frames, which means reference images are randomly sampled from the two sides of the nearby frames of the current frame. For TransVOD lite, the total frames of the video clip are sequentially fed into the model. In all experiments, we use the same data augmentation as MEGA , including random horizontal flip, randomly resizing. We train the model for 14 epochs in an end-to-end manner. For better convergence, we first train the spatial Transformers for 7 epochs and then fine-tune the temporal Transformers for another 7 epochs. In the fine-tuning process, we freeze the parameters of spatial Transformers and only optimize the temporal Transformers.

Inference details: The inference runtime (FPS) of Table II is calculated on a single V100 GPU card. During the inference phase, the batch size is 1 and we sample the reference frame with a fixed frame stride for TransVOD and TransVOD++. As mentioned in Sec 3.5, we sample the frames via random shuffling for TransVOD Lite. During the inference phase, we use the same data augmentation as MEGA for image resizing such that the shortest side of the image is at least 600 while the longest is at most 1000. Note that we do not need any sophisticated post-processing method, which largely simplifies the pipeline of VOD.

2 Main Results

We first compare our proposed TransVOD and TransVOD++ using ResNet-50 backbone in Table I. Then we present the detailed results with the previous state-of-the-art VOD methods in Table III. Finally, we compare the real-time VOD models in Table II.

Results using ResNet-50 backbone. Table I shows the comparison results with the state-of-the-art VOD methods with ResNet-50 backbone. For a fair comparison, we also report the performance of existing VOD methods that use the COCO-pretraining model. Despite the use of COCO-pretraining weights boosts the mAP of existing VOD methods, our proposed TransVOD still achieves superior performance against the state-of-the-art methods by a large margin. In particular, TransVOD achieves 79.9 $\%$ with ResNet-50, which makes 1.3% $\sim$ 2.6 $\%$ absolute improvements over the best competitor MEGA . Moreover, our proposed TransVOD++ further improves the original TransVOD by 0.6 $\%$ , achieving 80.5 $\%$ on the ImageNet VID validation set.

Results with stronger backbone. We further report stronger backbone results to compare with the state-of-the-art methods in Table III. When equipped with a stronger backbone ResNet-101, the mAP of our TransVOD++ is further boosted up to 82.0%, which outperforms most state-of-the-art VOD methods . Specifically, our model is remarkably better than FGFA (76.3 $\%$ mAP) and MANet (78.1 $\%$ mAP), which both aggregate features based on optical flow estimation, and the mAP improvements are +5.6 $\%$ mAP and +3.8 $\%$ mAP respectively. When compared with some relation-based methods (LRTRN (81.0 $\%$ mAP), RDN (81.8 $\%$ mAP), SELSA (80.3 $\%$ mAP)), our method also shows its superiority in case of detection precision. Moreover, our proposed method boosts the strong baseline i.e., deformable DETR by a significant margin (3% $\sim$ 4% mAP). After adopting Swin Base (SwinB) as the backbone, our TransVOD++ achieve 90.0 $\%$ mAP and it outperforms previous works by a large margin (about 4 % $\sim$ 5% mAP), which further demonstrate its effectiveness.

Results using TransVOD Lite In Table II, we report the results of our TransVOD Lite and compare it with previous real-time VOD models. As shown in that table, using the ResNet-101 backbone, our method achieves the best speed and accuracy trade-off. After adopting Swin-Tiny as the backbone, our TranVOD Lite achieves 83.7 $\%$ mAP while running at nearly 30 FPS. Our best TransVOD Lite model with a Swin base backbone can achieve 90.1 $\%$ mAP while running at around 15 FPS. Furthermore, the parameter count (46.9M) is fewer than other video object detectors (e.g., around 100M in ), which also indicates that our method is more friendly for mobile devices.

3 Ablation Study and Analysis

Overview. In this section, we demonstrate the effect of key components in our proposed methods including TransVOD, TransVOD++ and TransVOD Lite. For TransVOD, we adopt ResNet-50 as the backbone. For TransVOD++ and TransVOD Lite, we adopt Swin Transformer as the backbone.

Effectiveness of each component in TransVOD. Table IVb(a) summarizes the effects of different design components on the ImageNet VID dataset. The single-frame baseline of Deformable DETR is 76.0 $\%$ and 88.3 $\%$ with ResNet50 and Swin-Base Transformer, respectively. By merely using TDTE and TDTD, we boost the baseline with an additional +1.1 $\%$ and +0.5 $\%$ on the two backbones, respectively. Besides, by only adding TQE, we boost the baseline with an additional +2.9 $\%$ , +1.0 $\%$ on the two backbones, respectively. The combination of TQE and TDTD increase the mAP to 79.3 $\%$ , 89.6 $\%$ , respectively. Finally, the proposed TransVOD including all components achieves 79.9 $\%$ and 89.6 $\%$ with ResNet50 and Swin-Base Transformer, respectively. These improvements show the effects of individual components of our TransVOD. Interestingly, we find the effect of TDTE fades away if we use a stronger backbone, e.g., Swin Transformer.

Number of encoder layers in TDTE. Table V(a) illustrates the ablation study on the number of encoder layers in TDTE. We observe that when the number of TDTE encoder layers are larger than 1, it brings no significant benefits to the final performance. This experiment also proves the claim that aggregating the feature memories in a temporal dimension via deformable attention is useful for learning the temporal contexts across different frames.

Number of encoder layers in TQE. Table V(b) shows the ablation study on the number of encoder layers in TQE. It shows that the best result occurs when the number query layer is set to 5. When the number of layers is up to 3, the performance is basically unchanged. Thus, we use 3 encoder layers in our final method.

Number of decoder layers in TDTD. Table V(c) illustrates the ablation study on the number of decoder layers in TDTD. The basic setting is 4 reference frames, 1 encoder layer in TQE, and 1 encoder layer in TDTE. The results indicate that only one decoder layer in TDTD is needed, and we set this number by default.

Number of top $k$ object queries in TQE. To verify the effectiveness of our coarse-to-fine Temporal Query Aggregation strategy, we conduct ablation experiments in Table V(d) and Table IVb(b) to study how they contribute to the final performance. All the experiments in each table are conducted under the same setting. The first experiment is that when we use 1 encoder layer in TQE with 4 reference frames, the best performance is achieved when we choose the top 100 spatial object queries for each reference frame. The second experiment is conducted in a multiple TQE encoder layers case, i.e., 3 encoder layers in TQE. We denote the fine-to-fine (F2F) selection by using a small number of spatial object queries in each TQE encoder layer. Coarse-to-coarse (C2C) means selecting a large number of spatial object queries when performing the aggregation in each layer. Our proposed coarse-to-fine (C2F) aggregation strategy uses larger number of spatial object queries in the shallow layers and a smaller number of spatial object queries in the deep layers. The results in Table IVb(b) show that our C2F aggregation strategy is superior to both the C2C and F2F selection.

Number of reference frames in TransVOD. Table V(d) illustrates the ablations on number of reference. The basic setting is 3 encoder layers in TQE, 1 encoder layer in TDTE, and 1 decoder layer in TDTD. As shown in Table V(d), the mAP improves when the number of reference frames increases, and it tends to stabilize when the number is up to 8. Thus, in all experiments, we set the reference frames to 14 for TransVOD with different backbones.

3.2 Ablation for TransVOD++

Effect of each component in TransVOD++ on strong baseline. In Table VIc(a), we verify the effectiveness of each component in TransVOD++ on a strong baseline. Adding RoI and Query Fusion results in 1.4 $\%$ mAP improvements, while applying Hard Query Mining leads to extra 0.3 $\%$ mAP improvements and 1.6 $\%$ mAP improvements on small objects. This proves that our proposed Hard Query Mining is suitable for detecting small objects.

Effect of reference frames in TransVOD++. In Fig. 6 (a), we show the effect of reference frames in TransVOD++ where we find the best reference frames is 14. This is different from the original TransVOD. We argue that utilizing more RoI information rather than full-frame fusion in the temporal dimension leads to better results. This finding is consistent with previous works focusing on RoI-wised fusion in Faster-RCNN framework. We set the number of reference frames to 14 by default.

Improvements over different baselines. In Fig. 6 (b), we show the improvements over different single-frame baselines including Swin Transformer and ResNet . Swin Base, Swin Small, and Swin Tiny are abbreviated as SwinB, SwinS, SwinT, respectively. Our proposed TransVOD++ can boost the gain over 1.7% $\sim$ 4.2% mAP on various baselines. We observe that our TransVOD++ outperforms TransVOD Lite under different backbones. Especially, with ResNet-50 and ResNet-101, the improvements of TransVOD++ (2.3% $\sim$ 4.2% mAP ) are larger than the ones (0.8% $\sim$ 1.6%) mAP of TransVOD Lite. Interestingly, we found that with the backbones of Swin Transformers, TransVOD Lite achieves almost the same performance as TransVOD++. This is mainly because the single frame baselines of Swin Transformer are too strong (88.3%) mAP) and these improvements over the strong baseline are not as obvious as the ones of ResNet.

Effect of multi-level feature fusion. Table VIc(b) shows the improvements on multi-level feature fusion. In total, there is a 0.6 $\%$ mAP50 gain. Moreover, there is a more significant gain (2.3%) on mAP50:95 which indicates multi-scale information leads to more accurate detection results. Thus, we adopt the simple multi-level feature fusion by default when adopting Swin Transformer as the backbone for both TransVOD++ and TransVOD Lite.

Effect of COCO pre-training using Swin base. Following previous VOD methods , we pre-train our image detector on the COCO dataset . As shown in Table VIc(c), removing COCO pre-training leads to a huge performance drop. The main reason lies in the fact that vision Transformers need more training samples for better convergence and vision Transformers are typically pre-trained on large-scale datasets. Thus, we pre-train the TransVOD series on the COCO dataset.

3.3 Ablation for TransVOD Lite

Effect of window size in TransVOD Lite. In Fig. 8 (a) and Fig. 8 (b), we show the effect of window size on both accuracy and inference time where the interval mode is randomly shuffled within the window for all experiments. As shown in these figures, increasing window size leads to both accuracy improvements and FPS increase for both Swin Tiny and Swin base as backbones. In Table VIId(a) and Table VIId(b), we detail the results of the above figures. We choose the best window size $T_{w}$ as 15 for all models.

Effect of interval size and mode in TransVOD Lite. In Table VIId(c), we show the effect of interval size between frames in each fixed window. For different window sizes, increasing the interval size leads to better results. This indicates that fusing more global temporal information leads to better results. However, adopting our proposed randomly shuffled strategy results in the best performance on different window sizes. This is mainly because random shuffles increase the diversity of each frame. For example, the global and local temporal information can exist in one window. Moreover, during training, the frames are randomly selected from each clip. Thus, randomly shuffled inputs share the same distribution with training examples. We report the final performance using such settings. Moreover, as shown in Table VIId(c), even with the sequential inputs, our methods can still achieve the best performance compared with the methods in Table II.

Ablation on query numbers in Sequential Hard Query Mining. In Table VIId(d), we perform ablation studies on Sequential Hard Query Mining (SeqHQM) in TransVOD Lite. From the table, we find the best hyper-parameter with 80, 50, 30 queries for each stage. We use that setting for all the TransVOD Lite models.

4 Visualization and Analysis

Visual detection results. As shown in Fig. 10, we show the visual detection results of still image detector, i.e., Deformable DETR and our proposed TransVOD in odd and even rows, respectively. The still image detector is easy to cause false detection (e.g., turtle detected as a lizard) and missed detection (e.g., zebra not detected), in the case of motion blur, part occlusion. Compared with Deformable DETR , our method effectively models the long-range dependencies across different video frames to enhance the features of the detected image. Thus, our TransVOD not only increases the confidence of correct prediction, but also effectively reduces the number of cases that are missed or falsely detected. Besides, as shown in Fig. 10 (b), our TransVOD Lite shows more confident scores than the single frame baseline .

Visual sampling locations of object query in TransVOD. To further explore the advantages of TQE, we visualize the sampling locations of both spatial object query and temporal object query in Fig. 7. The sample locations indicate the most relevant context for each detection. As shown in the figure, for each frame in each clip, our temporal object query has more concentrated and precise results on foreground objects while the original spatial object query has more diffuse results. This proves that our temporal object query is more suitable for detecting objects in video. This explains the effectiveness of our temporal query fusion.

Failure case analysis. In Fig. 9, we present several failure cases using our best TransVOD Lite model. The first two rows show the missing detection problems. The first row is mainly due to the larger motion blur, and the second row is caused by the various background change. The last two rows show the false detection where a car is detected as a bus. This is caused by the large occlusion. Both cases show that tackling occlusion and more stable temporal modeling are needed for the further works.

Conclusion

In this paper, we proposed a novel video object detection framework, namely TransVOD, which provides a new perspective of feature aggregation by leveraging spatial-temporal Transformers. TransVOD effectively removes the need for many hand-crafted components and complicated post-processing methods. Our core idea is to aggregate both the spatial object queries and the memory encodings in each frame via temporal Transformers. Our TransVOD boosts the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset on various baselines. To our knowledge, our work is the first one that applies the Transformers to VOD. Based on the TransVOD framework, we present two advanced versions, namely TransVOD++ and TransVOD Lite. The former improves the performance of TransVOD via better Query and RoI fusion (QRF), and Hard Query Mining (HQM) to fully utilize the object-level information, and dynamically reduce the number of object queries and targets. The latter focuses on real-time video object detection by modeling VOD as a sequence-to-sequence prediction problem via Sequential Hard Query Mining (SeqHQM). Both models set new state-of-the-art results on the ImageNet VID dataset on two different settings: accuracy for non-real-time models and best speed-accuracy trade-off on real-time models. Our method is the first work that achieves 90 $\%$ mAP on ImageNet VID dataset. Moreover, TransVOD is the first work that achieves 83.7% mAP while running in real time. We believe our models can be new baselines for this area.