Language as Queries for Referring Video Object Segmentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo

Introduction

Referring video object segmentation (R-VOS) aims to segment the target object in a video given a natural language description. This emerging topic has raised great attention in the research community and is expected to benefit many applications in a friendly and interactive way, e.g., video editing and video surveillance. R-VOS is more challenging than the traditional semi-supervised video object segmentation , because it does not only lack the ground-truth mask annotation in the first frame, but also require the comprehensive understanding of the cross-modal sources, i.e., vision and language. Therefore, the model should have a strong ability to infer which object is referred and to perform accurate segmentation.

To accomplish this task, the existing methods can be mainly categorized into two groups: (1) Bottom-up methods. These methods incorporate the vision and language features in a early-fusion manner, and then adopt a FCN as decoder to generate object masks, as shown in Figure 1(a). (2) Top-down methods. These methods tackle the problem in a top-down perspective and follow a two-stage pipeline. As illustrated in Figure 1(b), they first employ an instance segmentation model to find all the objects in each frame, and then associate them in the entire video to form the tracklet candidates. Afterwards, they use the expression as the grounding criterion to select the best-matched one.

Although these two streams of methods have demonstrated their effectiveness with promising results, they still have some intrinsic limitations. First, for the bottom-up methods, they fail to capture the crucial instance-level information and do not consider the object association across multiple frames. Therefore, this type of methods can not provide explicit knowledge for cross-modal reasoning and would encounter the discrepancy of predicted object due to scene changes. Second, although top-down methods have greatly boost the performance over the bottom-up methods, they suffer from heavy workload because of the complex, multi-stage pipeline. For example, the recent method proposed by Liang et al. comprises of three parts: HTC , CFBI and a tracklet-language grounding model. All these networks need to be pretrained on the ImageNet , COCO or RefCOCO and further finetuned on R-VOS datasets, respectively. Furthermore, the separate optimization on several sub-problems would lead to sub-optimal solution.

These limitations of current methods motivate us to design a simple and unified framework that solves the R-VOS task elegantly. The recent success of Transformer in object detection and video instance segmentation demonstrates a promising solution. However, it is non-trivial to apply such models to the R-VOS task. These models use a fixed number (e.g., 100) of learnable queries to detect all the objects in an image. Under this circumstance, it would be confused for the model to distinguish which object is referred due to the randomness of the expression. Here raises a natural question: ”Is it possible for a unified model to know where to look using queries?”

This work answers the question by proposing the notion of language as queries, as shown in Figure 1(c). We put the linguistic restriction on all object queries and use these conditional queries as input for the model. In this manner, the expression will make the queries focus on the referred object only, and thus greatly reducing the query number (e.g., 5 in our experiments). The next challenge lies in how to decode the object mask from query representations. As the queries contain rich instance characteristics, we view them as instance-aware dynamic kernels to filter out the segmentation masks from feature maps. Moreover, to make the feature maps more discriminative, we design a novel cross-modal feature pyramid network (CM-FPN) where the visual and linguistic features interact in multiple levels for fine-grained cross-modal fusion.

The unified framework can not only produce the segmentation masks for referred objects, but also the classification results and detection boxes. Moreover, the conditional queries are linked via instance matching strategy across frames so that the object tracking is achieved naturally without post-process. As shown in Figure 5, our unified framework is able to detect, segment and track the referred object simultaneously. We hope this framework could serve as a strong baseline for R-VOS task.

The main contributions of this work are as follows.

We propose a simple and unified framework for referring video object segmentation, termed ReferFormer. Given a video clip and the corresponding language expression, our framework directly detects, segments and tracks the referred object in all frames in an end-to-end manner.

We present the notion of language as queries. We introduce a small set of object queries which conditioned on the text expression to attend the referred object only. These conditional queries are shared across different frames in the initial state and they are transformed into dynamic kernels to filter out the segmentation masks from feature maps. This mechanism provides a new perspective for the R-VOS task.

We design the cross-modal feature pyramid network (CM-FPN) for multi-scale vision-language fusion, which improves the discriminativeness of mask features for accurate segmentation.

Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show that ReferFormer outperforms the previous methods on these four benchmarks by a large margin. E.g., on Ref-Youtube-VOS, ReferFormer with a ResNet-50 backbone achieves 55.6 $\mathcal{J}\&\mathcal{F}$ without bells and whistles, showing the significant 8.4 points gain over the previous state-of-the-art methods. And using the strong Video-Swin-Base visual backbone, ReferFormer achieves the impressive results of 64.9 $\mathcal{J}\&\mathcal{F}$ .

Related Work

Semi-supervised Video Object Segmentation. The traditional semi-supervised video object segmentation (Semi-VOS) aims to propagates the ground-truth object masks given in the first frame to the entire video. Most recent works lie in the group of matching-based methods, which perform feature matching to track the target objects. STM leverages a memory to store the past object features and utilize the attention matching mechanism on the memory to guide the prediction of current frame. CFBI not only considers the embedding learning of foreground objects but also the background, resulting in a more robust framework.

Referring Video Object Segmentation. Referring video object segmentation (R-VOS) provides the language description instead of mask annotation as the object reference, thus it would be a more challenging task. The current methods for R-VOS mainly follow the two pipelines: (1) Bottom-up methods. An intuitive thinking is directly applying the image-level methods on the video frames independently, e.g., RefVOS . The obvious drawback of such methods is that they fail to utilize the valuable temporal information across frames, resulting in inconsistent object prediction due to the scene or appearance variations. To address this issue, URVOS casts the task as a joint problem of referring object segmentation in an image and mask propagation in a video. They propose a unified referring VOS framework that employs a memory attention module to leverage the information of mask predictions in previous frames. (2) Top-down methods. The typical top-down method first constructs an exhaustive set of object tracklets by propagating the object masks detected from several key frames to the whole video. Then, a language grounding model is built to select the best object tracklet from the candidate set. Although the method has made breakthrough performance improvement over the previous methods, the complex, multi-stage pipeline is computational-expensive and impractical.

In contrast to these two pipelines, we propose a query-based method that achieves the strongest performance with a simple and unified framework. The very recent work MTTR also relies on the query-based mechanism. Nevertheless, they need the exhaustive segmentation annotations of all objects and supervised the un-referred instances during training process, which increases the workload of laborious annotation and makes the framework limited in practical applications.

Transformer Transformer was first introduced for sequence-to-sequence translation in natural language processing (NLP) community and has achieved marvelous success in most computer vision tasks such as object detection, tracking and segmentation. DETR introduces the new query-based paradigm for object detection, which employs a set of object queries as candidates and inputs them to the Transformer decoder. Beyond image field, VisTR extends the framework for video instance segmentation (VIS) task and solves the problem in a direct end-to-end parallel sequence decoding manner. SeqFormer decouples the content query and box query to aggregates temporal information from each frame and achieves the state-of-the-art performance on VIS task. Inspired by these works, our work also relies on the query-based mechanism of Transformer but considers an additional modality, i.e., language, as the object reference. Thus, we propose the notion of language as queries and build the simple and unified framework that detects, segments and tracks the referred object simultaneously.

Approach

Visual Encoder. We start by adopting a visual backbone to extract the multi-scale feature maps for each frame in the video clip independently, resulting in the visual feature sequence $\mathcal{F}_{v}=\left\{f_{t}\right\}_{t=1}^{T}$ . It is noteworthy that both the 2D spatial encoder (e.g., ResNet ) and 3D spatio-temporal encoder (e.g., Video Swin Transformer could play the role of visual backbone.

2 Language as Queries

The key design comes from that we use a set of object queries conditioned on the language expression, termed conditional queries, as the Transformer decoder input. These queries are obligated to focus on the referred object only and produce the instance-aware dynamic kernels. The final segmentation masks are obtained by performing dynamic convolution between the dynamic kernels and their corresponding feature maps. Here, we adopt the Deformable-DETR as our Transformer model due to its effectiveness and efficiency to capture the global pixel-level relations.

Transformer Encoder. First, a 1 $\times$ 1 convolution is applied on the multi-scale visual features $\mathcal{F}_{v}$ to reduce the channel dimension of all feature maps to $C=256$ . To enrich the information of visual features, we then incorporate projected visual features with the text feature $\mathcal{F}_{e}$ in a multiplication way and form the new multi-scale feature maps $\mathcal{F}_{v}^{{}^{\prime}}=\left\{f_{t}^{{}^{\prime}}\right\}_{t=1}^{T}$ . Afterwards, the fixed 2D positional encoding is added to feature maps of each frame and the summed features are fed into the Transformer encoder. To utilize the Transformer process the video frames independently, we flatten the spatial dimensions and move the temporal dimension to batch dimension for efficiency. Finally, the output of the Transformer encoder, i.e., encoded memory, is then input to the decoder.

Transformer Decoder. We introduce $N$ object queries to represent the instances for each frame similar to , the difference lies in that the query weights are shared across video frames. This mechanism is more flexible to handle the length-variable videos and is more robust for the queries to track the same instances. Meanwhile, we repeat the sentence feature $f_{e}^{s}$ for $N$ times to fit the query number. Both the object queries and repeated sentence features are fed into the decoder as input. In this manner, all the queries will use the language expression as guidance and try to find the referred objects only. These conditional queries are duplicated to serve as the decoder input for all the frames and they are turned into instance embeddings by the decoder eventually, resulting in the set of $N_{q}=T\times N$ predictions. It should be noted the queries keep the same order across different frames and we refer to the queries in the same relative position (represented as the same shape in Figure 2) as instance sequence following . Therefore, the temporal coherence of referred object could be achieved easily by linking the corresponding queries.

Prediction Heads. Three lightweight heads are built on top of the decoder to further transform the $N_{q}$ instance embeddings. The class head outputs the binary probability which indicates whether the instance is referred by the text sentence and this instance is visible in the current frame. It could also be modified to predict the referred object category by simply changing the output class number. The mask head is implemented by three consecutive linear layers. It produces the parameters of $N_{q}$ dynamic kernels $\Omega=\left\{\omega_{i}\right\}_{i=1}^{N_{q}}$ , which is similar to the conditional convolutional filters in . These parameters will be reshaped to form the three $1\times 1$ convolution layers with the channel number as 8. The box head is a 3-layer feed forward network (FFN) with ReLU activation except for the last layer. It will predict the box location of the referred object and thus the position of dynamic kernels could be determined by the center of corresponding boxes.

Dynamic Convolution. Suppose now we have obtained the semantically-rich feature maps $\mathcal{F}_{\text{seg}}=\left\{f_{seg}^{t}\right\}_{t=1}^{T}$ (will be discussed in Sec. 3.3) for each frame, the question is how we perform the instance sequence segmentation and obtain the masks of referred object from them. Since the dynamic kernels have captured the object-level information, we use them as convolution filters on the feature maps for instance decoding. Considering that the location prior of dynamic kernels $\Omega$ provides a strong and robust reference for the referred object, we concatenate the feature maps $\mathcal{F}_{seg}$ with relative coordinates for each dynamic kernel. Finally, the binary segmentation masks are generated by performing dynamic convolution between the conditional convolutional weights and their corresponding feature maps:

Illustration of conditional queries. It is well known that the decoder embedding and position embedding in Transformer decoder encode the content and spatial information respectively. In our framework, these two parts are fed with the text sentence feature and learanble queries parameters, so that all the queries are restricted by the language expression. As shown in figure 3, these queries will focus on the referred object only even if other objects exist in the video. And there will be one query with much higher score while the scores of other queries will be suppressed.

3 Cross-modal Feature Pyramid Network

Feature pyramid network (FPN) is adopted to produce multi-scale feature maps for video frames. We construct a 4-level pyramid with the spatial stride from 4 $\times$ to 32 $\times$ . Specifically, the first three stage features of Transformer encoded memory (with spatial strides $\left\{8,16,32\right\}$ ) and the 4 $\times$ feature from visual backbone are stacked to form the hierarchical features. Although the standard FPN can already provide a high-resolution feature map with rich visual semantics, such feature map lacks the linguistic information and would be sub-optimal for the cross-modal task. The previous work only incorporates the language feature on the top level of FPN, which is a coarse fusion fashion. Here, we design a cross-modal feature pyramid network (CM-FPN) to perform multi-scale cross-modal fusion for finer interaction, as shown in Figure 4.

4 Instance Sequence Matching and Loss

Using $N$ conditional queries, we generate the set of $N_{q}=T\times N$ predictions, which can be regarded as the trajectories of $N$ instances on $T$ frames. As described previous, the predictions across frames maintain the same relative positions. Therefore, we can supervise the instance sequence as a whole using instance matching strategy . Let us denote the prediction set as $\hat{y}=\left\{\hat{y}_{i}\right\}_{i=1}^{N}$ , and the predictions for the $i$ -th instance is represented by:

Since there is only one referred object in the video, the ground-truth instance sequence is represented as $y=\left\{c^{t},b^{t},s^{t}\right\}_{t=1}^{T}$ . $c^{t}$ is an one-hot value and it equals 1 when the ground-truth instance is visible in the frame $I_{t}$ otherwise 0. To train the network, we first find the best prediction as the positive sample via minimizing the matching cost:

The matching cost is computed from each frame and normalized by the frame number. Here, $\mathcal{L}_{cls}(y,\hat{y}_{i})$ is the focal loss that supervises the predicted instance sequence reference results. The box-related loss sums up the L1 loss and GIoU loss . And the mask-related loss is the combination of DICE loss and binary mask focal loss. Both the two mask losses are spatio-temporally calculated over the entire video clip. The network is optimized by minimizing the total loss $\mathcal{L}_{match}$ for positive samples while letting the negative samples predict the $\varnothing$ class.

5 Inference

As mentioned previously, ReferFormer can handle the videos of arbitrary length in a single forward pass since all the frames share the same initial conditional queries. Given the video and language expression, ReferFormer will predict $N$ instance sequence. For each instance query, we average the predicted reference probabilities over all the frames and obtain the reference score set $\mathcal{P}=\left\{p_{i}\right\}_{i=1}^{N}$ . We select the instance sequence with the highest average score and its index is denoted as $\sigma$ :

The final segmentation masks for each frame $\mathcal{S}=\left\{s_{t}\right\}_{t=1}^{T}$ is obtained from the mask candidates set $\hat{\mathcal{S}}$ by selecting the corresponding queries indexed with $\sigma$ . No post-process is needed for associating objects since the linked queries naturally track the same instance.

Experiments

Datasets. The experiments are conducted on the four popular R-VOS benchmarks: Ref-Youtube-VOS , Ref-DAVIS17 , A2D-Sentences and JHMDB-Sentences . Ref-Youtube-VOS is a large-scale benchmark which covers 3,978 videos with $\sim$ 15K language descriptions. Ref-DAVIS17 is built upon DAVIS17 by providing the language description for a specific object in each video and contains 90 videos. A2D-Sentences and JHMDB-Sentences are created by providing the additional textual annotations on the original A2D and JHMDB datasets. A2D-Sentences contains 3,782 videos and each video has 3-5 frames annotated with the pixel-level segmentation masks. JHMDB-Sentences has 928 videos with the 928 corresponding sentences in total.

Evaluation Metrics. We use the standard evaluation metrics for Ref-Youtube-VOS and Ref-DAVIS17: region similarity ( $\mathcal{J}$ ), contour accuracy ( $\mathcal{F}$ ) and their average value ( $\mathcal{J}\&\mathcal{F}$ ). For Ref-Youtube-VOS, as the annotations of validation set are not released publicly, we evaluate our method on the official challenge server https://competitions.codalab.org/competitions/29139. Ref-DAVIS17 is evaluated by the official evaluation code https://github.com/davisvideochallenge/davis2017-evaluation.

On A2D-Sentences and JHMDB-Sentences, the model is evaluated with the criteria of Precision@K, Ovrall IoU, Mean IoU and mAP over 0.50:0.05:0.95. The Precision@K measures the percentage of test samples whole IoU scores are higher than the threshold K. Following standard protocol, the thresholds are set as 0.5:0.1:0.9.

2 Implementation Details.

Model Settings. We test our models under different visual backbones including: ResNet , Swin Transformer and Video Swin Transformer . The text encoder is selected as RoBERTa and its parameters are frozen during the entire training stage. Following , we use the last stage features from the visual backbone as the input to Transformer, their corresponding spatial strides are $\left\{8,16,32\right\}$ . In the Transformer model, we adopt 4 encoder layers and 4 decoder layers and the hidden dimension is $C=256$ . The number of conditional query is set as 5 otherwise specified.

Training Details. During training, we use sliding-windows to obtain the clips from a video and each clip consist of 5 randomly sampled frames. Following , the data augmentation includes random horizontal flip, random resize, random crop and photometric distortion. All frames are downsampled so that the short side has the size of 360 and the maximum size for the long side is 640 to fit GPU memory. The coefficients for losses are set as $\lambda_{cls}=2$ , $\lambda_{L1}=5$ , $\lambda_{giou}=2$ , $\lambda_{dice}=5$ , $\lambda_{focal}=2$ .

Most of our experiments follow the pretrain-then-finetune process. And some models are trained from scratch for fair comparison. Additionally, on Ref-Youtube-VOS, we also reports the results by training the mixed data from Ref-Youtube-VOS and Ref-COCO . The joint training technique has proven the effectiveness in many VIS tasks . Please see more in the supplementary materials.

Inference Details. During inference, the video frames are downscaled to 360p. We directly output the predicted segmentation masks without post-process. On Ref-Youtube-VOS, we further use a simple post-process technique to refine the object masks. Concretely, we first select a frame with the highest prediction score as the reference frame. Then, we apply the off-the-shelf mask propagation method CFBI to propagate the predicted mask of this frame forward and backward to the entire video. The results with post-process are shown in Table 5.

3 Main Results

Ref-Youtube-VOS & Ref-DAVIS17 We compare our method with other state-of-the-art methods in Table 1. CITD and PMINet are the top-2 solutions in 2021 Ref-Youtube-VOS Challenge. Their ensemble results are based on building 5 and 4 models, respectively. It can be observed that ReferFormer outperforms previous methods on the two datasets under all metrics and with a large marge. On Ref-Youtube-VOS, ReferFormer with a ResNet-50 backbone achieves the overall $\mathcal{J}\&\mathcal{F}$ of 55.6, which is 8.4 points higher than the previous state-of-the-art work URVOS , and even beats PMINet using the ensemble models and adopting post-process (55.6 vs 54.2). Using the strong Swin-Large backbone, ReferFormer reaches the surprising 62.4 $\mathcal{J}\&\mathcal{F}$ without bells and whistles, which obviously exceeds the ensemble results of the complicated, multi-stage method CITD . By using the joint training process, the performance of our model can be further boosted to 64.2 $\mathcal{J}\&\mathcal{F}$ , creating a fairly high new record. Additionally, we also test the Video Swin Transformer as the backbones. It is well known that the spatio-temporal visual encoder has strong ability to capture both the spatial characteristics and the temporal cues. For a fair comparison with MTTR , we train our model with the Video-Swin-Tiny backbone from scratch. It can be seen that our method outperforms MTTR under all the metrics with the smaller window size (5 vs 12). Comparing the results of ReferFormer under Video-Swin-Tiny backbone, it proves that the model benefits from the pretraining stage and joint training process to address the overfitting issue.

On Ref-DAVIS17, our method also achieves the best results under the same ResNet-50 setting (58.5 $\mathcal{J}\&\mathcal{F}$ ). And the performance consistently improves by using stronger backbones, which proves the generality of our method.

A2D-Sentences & JHMDB-Sentences We further evaluate our method on the A2D-Sentences dataset and compare the performance with other state-of-the-art methods in Table 2. ClawCraneNet is a mutli-stage method which use the off-the-shelf instance segmentation model (with ResNet-101 backbone) to provide the mask candidates. From Table 2, it is obvious that our method achieves the impressive improvement over the previous methods. Compared with the recent MTTR , our method exhibits the clear performance advantange ( $+$ 2.5 mAP) with smaller window size (6 vs. 10). Incorporating the pretraining stage, ReferFormer with Video-Swin-Base visual backbone achieves 55.0 mAP which shows a significant gain of 8.9 mAP over previous best result. And ReferFormer also demonstrates its strong ability to produce high-quality masks via the stringent metrics (e.g., 57.9 for P@0.8 and 21.2 for P@0.9).

We also evaluate the models on JHMDB-Sentences without finetuning to further prove the generality of our method. As shown in Table 3, ReferFormer significantly outperforms all the existing methods. It is noticeable that all the methods produce low scores on P@0.9. A possible reason is that the ground-truth masks are generated from human puppets, leading to the inaccurate mask annotations.

4 Ablation Study

In this section, we perform extensive ablation studies on Ref-Youtube-VOS to study the effect of core components in our model. All models are based on Video-Swin-Tiny visual backbone and we train the models from scratch otherwise specified. The detailed analysis is as follows.

First, from the first row of Table 4, the baseline method only achieves 47.2 $\mathcal{J}$ and 50.1 $\mathcal{F}$ . This inferior behavior attributes to two reasons: (1) The baseline method can not distinguish the similar objects that are close together and tends to segment the most salient region. In contrast, our method performs well with only 1 conditional query (see Table 6(a)), proving that dynamic convolution is essential for segmenting the referred object. (2) Our method uses a set of shared queries to track instances in all frames, and the best query is determined by the voting scores of each frame. In this sense, our model can produce a reliable reasoning result and keep the temporal consistency in the entire video. On the contrary, the baseline method could be regarded as a image-level method that independently predicts the results of each frame even though the model is able to aggregate the information from other frames.

Second, comparing the second and last row of Table 4, we can see that the standard FPN has already achieved strong performance and the vision-language fusion process further helps to provide more accurate segmentation. This is because the object mask would be inaccurate due to light variation, whereas the cross-modal fusion uses the text as a complementary to strengthen the object pixel features and thus facilitates the segmentation prediction. Another technique is concatenating the relative coordinates of dynamic kernels with the mask features, this would help the model better determine the location of referred object and lead to performance improvement, as shown in the third row in Table 4.

Visual Backbone. We implement different visual backbones and report the results in Table 5. As expected, the performance of our model consistently increases by using stronger backbones. And the CFBI post-process can help to further boost the performance under all backbone settings. Interestingly, we observe that the performance improvement by post-process tends to narrow when the backbone gets stronger, e.g., +3.8 for ResNet-50 and +0.9 for Swin-Large when considering the $\mathcal{J}\&\mathcal{F}$ metric. This phenomenon shows that the visual encoder is essential for providing reliable reasoning on which object is described and generating the precise masks.

Number of Conditional Queries. Benefit from the design of conditional queries, all the initial object queries tend to find the referred object only. In this situation, we can only use a relatively small number of queries. In Table 6(a), we study the effect of query number for each frame. It can be seen that the model achieves considerable results under all these settings, even with $N=1$ . Certainly, more queries enable the model make judgement from a wide range of instance candidates, which could better handle the complicated scenes where the similar objects are clustered together. The performance saturates at $N=5$ and begins to slightly decrease when the query number gets larger. We conjecture that it is caused by the imbalance of label assignment as there is only one positive sample in each frame.

Number of Training Clip Frames. We study the effect of training clip frame number in Table 6(b). Note that under $T=1$ , the model can be viewed as an image-level method and the performance of metric $\mathcal{J}\&\mathcal{F}$ is only 50.0. When the frame number increases to 3, the model enjoys an significant $\mathcal{J}\&\mathcal{F}$ gain of 4.8. This is because using more frames to form a clip helps the model better aggregate the temporal action-related information. We choose $T=5$ by default.

Label Assignment Method. Our framework is able to predict the reference probability, box location and segmentation mask of the referred object. We find the optimal positive sample by minimizing the overall matching cost in Eq.4. There are some variants in the label assignment method and we carry out the comparison experiments in Table 6(c). From the first two rows in Table 6(c) we show that the lack of box or mask cost would both lead to the performance drop. With the segmentation-centric design, the mask cost is the most direct guidance for optimization, and the box provides the location prior for dynamic kernel. Thus, the combination of classification, box and mask cost shows more robustness.

5 Visualization Results

We show the visualization results of our model in Figure 5. It can be seen that ReferFormer is able to segment and track the referred object in challenging cases, e.g., person pose variations, instances occlusion and instances that are partially displayed or completely disappeared in the camera.

Conclusion

In this work, we propose ReferFormer, an extremely simple and unified framework for referring video object segmentation. This framework provides a new perspective for the R-VOS task which views the language as queries. These queries are restricted to attend to the referred object only, and the object tracking is easily achieved by linking the corresponding queries. Given the video clip and an expression, our framework directly produces the segmentation masks as well as the detected boxes of the referred object in all frames without post-process. We validate our model on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences and it shows the state-of-the-art performance on the four benchmarks.

References

Appendix A Additional Dataset Details

Ref-Youtube-VOS is a large-scale benchmark which covers 3,978 videos with $\sim$ 15K language descriptions. There are 3,471 videos with 12,913 expressions in training set and 507 videos with 2,096 expressions in validation set. According to the R-VOS competition, videos in the validation set are further split into 202 and 305 videos for the competition validation and test purpose. Since the test server is currently inaccessible, the results are reported by submitting our predictions to the validation serverhttps://competitions.codalab.org/competitions/29139.

Ref-DAVIS17 is built upon DAVIS17 by providing the language description for a specific object in each video. It contains 90 videos with 1,544 expression sentences describing 205 objects in total. The dataset is split into 60 videos and 30 videos for training and validation, respectively. Since there are two annotators and each of them gives the first-frame and full-video textual description for one referred object, we report the results by averaging the scores using the official evaluation code https://github.com/davisvideochallenge/davis2017-evaluation.

Appendix B Additional Implementation Details

Our model is optimized using AdamW optimizer with the weight decay of $5\times 10^{-4}$ , initial learning rate of $5\times 10^{-5}$ for visual backbone and $10^{-4}$ for the rest. We first pretrain our model on the image referring segmentation datasets Ref-COCO , Ref-COCOg and Ref-COCO+ by setting $T=1$ with the batch size of 2 on each GPU. The pretrain procedure runs for 12 epochs with the learning rate decays divided by 10 at epoch 8 and 10. Then, on Ref-Youtube-VOS, we finetune the model for 6 epochs with 1 video clip per GPU. The learning rate decays by 10 at the 3-th and 5-th epoch. On Ref-DAVIS17, we directly report the results using the model trained on Ref-Youtube-VOS without finetune.

For A2D-Sentences, we feed the model with the window size of 5. The model is finetuned for 6 epochs with the learning rate decays at the 3-th and 5-th epoch by a factor of 0.1. On JHMDB-Sentences, following the previous works, we evaluate the generality of our method using the model trained on A2D-Sentences without finetune.

Additionally, on the Ref-Youtube-VOS, we also adopt the joint training technique by mixing the dataset with Ref-COCO/+/g. Specifically, for each image in the Ref-COCO dataset, we augment it with $\pm 20^{\circ}$ to form a 5-frame pseudo video clip. The joint training takes 12 epochs with the learning rate decays at the 8-th and 10-th epoch by a factor of 0.1. We use 32 V100 GPUS for the joint training and each GPU is fed with 2 video clips. It should be noted that the text encoder is froze all the time.

Appendix C Additional Details of Dynamic Convolution

We give the pseudo-code of dynamic convolution in Figure C1, where we take one dynamic kernel for clarification. Specifically, a linear projection is applied to transform the instance embedding into dynamic convolutional weights. Then, the mask features pass through consecutive dynamic convolutional layers with the ReLU activation function. There is no normalization or activation after the last dynamic convolutional layer, and the output channel number of last layer is 1.

Appendix D Additional Experiment Results

By default, our models are trained in the class-agnostic way, i.e., decide whether the object is referred or not. As described in Sec 3.2, the class head can be easily modified to predict the referred object category by simply change the class number. In this way, we train our model in a class-discriminative way and show the results in Table D1. We could observe the class-agnostic training method has clear performance gain ( $+$ 2.1 $\mathcal{J}\&\mathcal{F}$ ) over the strong class-discriminative training results, since the binary classification is easier to optimize. The selection of training method can flexibly depend on the usage in real applications.