VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang

Introduction

Referring segmentation aims at generating a segmentation mask for the target object that are referred by a given query expression in natural language . Referring segmentation is one of the most fundamental while challenging multi-modal tasks, involving both natural language processing and computer vision. It is intensely demanded in many practical applications, e.g., video/image editing, by providing a user-friendly interactive way. Recently, many deep-learning-based methods have arisen in this field and achieved remarkable results. However, great challenges still remain: while the query expression implies the target object by describing its attributes and its relationships with other objects, objects in referring segmentation images relate to each other in a complex manner. Therefore, a holistic understanding of the image and language expression is desired. Another challenge is that the diverse objects/images and the unconstrained language expressions bring a high level of randomness, which requires the modal high generalization ability in understanding different kinds of images and language expressions.

Firstly, to address the challenge of complicated correlations in the input image and query expression, we propose to enhance the holistic understanding of multi-modal information by designing a framework with global operations, in which direct interactions are built among all elements, e.g., word-word, pixel-pixel, and pixel-word. The Fully Convolutional Network (FCN)-like framework is commonly used in existing referring segmentation methods . They usually perform convolution operations on the fused, e.g., concatenated or multiplied, vision-language features to predict the segmentation mask for the target object. However, the long-range dependency modeling is intractable by regular convolution operation as its large receptive field is achieved by stacking many small-kernel convolutions. This oblique process makes the information interaction between long-distance pixels/words inefficient , thus is undesirable for the referring segmentation model to understand the global context expressed by the input image and language . In recent years, attention mechanism has gained considerable popularity in the computer vision community thanks to its advantage in building direct interaction among all elements, which greatly helps the model in capturing global semantic information. There have been some previous referring segmentation works that use attention to alleviate the long-range dependency issues, e.g., . However, most of them rely on FCN-like pipelines and only use the attention mechanism as auxiliary modules, which limits their ability to model the global context. In this work, we reformulate the referring segmentation problem as a direct attention problem and re-construct the current FCN-like framework using Transformer . We generate a set of query vectors from language features using vision-guided attention, and use these vectors to “query” the given image and predict the segmentation mask from the query responses, as shown in Figure 1. This attention-based framework enables us to implement global operation among multi-modal features in each computation stage and enhances the network’s ability to capture the global context of both vision and language information.

Secondly, in order to handle the randomness caused by the various objects/images and the unrestricted language expressions, we propose to understand the input language expression in different ways incorporating vision features. In many existing referring segmentation methods, such as , the language self-attention is often used to extract the most informative part and emphasized word(s) in the language expression. However, for these methods, their language understanding is derived solely from the language expression itself without interacting with the vision information. As a sequence, they cannot distinguish which emphasis is more suitable and effective that can fit a particular image better. Hence, their detected emphases might be inaccurate or inefficient. On the other hand, in most existing vision-transformer works , the queries of the transformer decoder are a set of fixed and learned vectors, each of which predicts an object. However, experiments show that each query vector has its own operating modes, and is specifically targeted at certain kinds of objects , e.g., specifically targeted at objects of a certain type or located in a certain area. The fixed queries in these works implicitly assume that the objects in the input image are distributed under some certain statistical rules, which does not well consider the randomness and huge diversity of the referring segmentation, especially the randomness brought by unconstrained language expressions. Besides, the learnable queries are designed for detecting all the objects in the whole image instead of focusing on the target object indicated by the language expression, thus cannot efficiently extract informative representation that contains the clues to the target object. To address these issues, we propose to generate input-specific queries that could focus on the clues related to the referred target object. We herein propose a Query Generation Module (QGM), which dynamically produces multiple query vectors based on the input language expression and the vision features. Each query vector represents a specific comprehension of the language expression and queries the vision features with different emphases. As shown in Figure 1, three queries focus on different information, respectively. These generated query vectors produce a set of corresponding masks in the transformer decoder though we only need one mask selected from them. Besides, we also hope to choose a more reasonable and better comprehension way from these query vectors. Therefore, we further propose a Query Balance Module (QBM), which assigns each query vector a confidence measure to control its impact on mask decoding, and then adaptively selects the output features of these queries to better generate the final mask. The proposed QGM dynamically produces input-specific queries that focus on different informative clues related to the target object, while the proposed QBM selectively fuses the corresponding responses by these queries. These two modules work together to prominently improve the diverse ways to understand the image and query language and enhance the network’s robustness towards highly random inputs.

Thirdly, we introduce masked contrastive representation learning to further enhance the model’s generalization ability and robustness to unconstrained language expressions. With the proposed Query Generation Module and Query Balance Module, we provide different understandings of a given expression, which can be viewed as a kind of intral-sample learning. Here we further consider inter-sample learning to explicitly endow the model with knowledge of different language expressions to one object. For the same target object, there are multiple ways to describe it. However, the final representations that predict the target mask should be the same. In other words, the output features of Query Balance Module by different expressions for the same object should be the same. To this end, we utilize contrastive learning to narrow down the features of different expressions for a same target object, while distinguishing the features of different objects. What’s more, we observe that the model tends to overly rely on specific words that provide the most discriminative clues or frequently occur in training samples, while ignoring other complementary information. The excessive reliance on specific words will damage the model’s generalization ability, for instance, the model may not well understand testing expressions that do not contain common discriminative clues in the training samples. To address this issue, we introduce masked language expressions in contrastive representation learning, which randomly erases some specific words from the original language expression. The masked language expression and the original expression refer to the same target object, they are considered as a positive pair in the contrastive representation learning to be close to each other and reach the same representation. The masked contrastive representation learning significantly enhances the model’s ability in dealing with diverse language expressions in the wild.

The proposed approach builds deep interactions between language and vision information at different levels, which greatly enhances the utilization and fusion of multi-modal features. Besides, the proposed network is lightweight and its parameter scale is roughly equivalent to just seven convolution layers. In summary, our main contributions are listed as follows:

We design a Vision-Language Transformer (VLT) framework to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features.

We propose a Query Generation Module (QGM) that dynamically produces multiple input-specific queries representing different comprehensions of the language, and a Query Balance Module (QBM) to selectively fuse the corresponding responses by these queries.

We introduce a masked contrastive representation learning to enhance the model’s generalization ability and robustness to deal with the unconstrained language expressions by learning inter-sample relationships.

The proposed approach is lightweight and achieves new state-of-the-art performance consistently on three referring image segmentation datasets, RefCOCO, RefCOCO+, G-Ref, and two referring video object segmentation datasets, YouTube-RVOS and Ref-DAVIS17.

Related Works

In this section, we discuss works that are closely related to the proposed approach, including referring segmentation, referring comprehension, and transformer.

Referring segmentation is one of the most fundamental while challenging multi-modal tasks, involving both language and vision information. Given a natural language expression describing the properties of the target object in the given image, the goal of the referring segmentation is to ground the target object referred by the language and generate a corresponding segmentation mask. Inspired by the task of referring comprehension , referring segmentation is introduced by Hu et al. in . concatenates the linguistic features extracted by Long Short-Term Memory (LSTM) networks and the visual features extracted by Convolutional Neural Networks (CNN). Then, the fused vision-language features is inputted to a fully convolutional network (FCN) to generate the target segmentation mask. In , in order to better utilize the information of each word in the language expression, Liu et al. propose a multimodal LSTM (mLSTM), which models each word in every recurrent stage to fuse the word feature with vision features. Li et al. utilize features from different levels in the backbone progressively, which further improves the performance. To better utilize the language information, Edgar et al. propose a method that uses the feature of each word in the language expression when extracting language features, not just the final state of the RNN. Chen et al. employ a caption generation network to produce a caption sentence that describes the target object, and enforce the caption to be consistent with the input expression. In , Luo et al. propose a multi-task framework to jointly learn referring expression comprehension and segmentation. They build a network that contains a referring expression comprehension branch and a referring expression segmentation branch, each of which can reinforce the other during training. Jing et al. decouple the referring segmentation to localization and segmentation and propose a Locate-Then-Segment (LTS) scheme to locate the target object first and then generate a fine-grained segmentation mask. Feng et al. propose to utilize the language feature earlier in the encoder stage. Hui et al. introduce a linguistic structure-guided context modeling to analyze the linguistic structure for better language understanding. Yang et al. propose a Bottom-Up Shift (BUS) to progressively locates the target object with hierarchical reasoning the given expression.

With the introduction of attention-based methods , researchers have found that the attention mechanism is suitable for the formulation of referring segmentation. For example, Ye et al. propose the Cross-Modal Self-Attention (CMSA) model to dynamically find the most important words in the language sentence and the informative image region. Hu et al. propose a bi-directional attention module to further utilize the features of words. Most of these works are built on FCN-like networks and only use the attention as auxiliary modules. Our concurrent work MDETR employs DETR to build an end-to-end modulated detector and reason jointly over language and image. After the proposed VLT , transformer-based referring segmentation architectures receive more attention . MaIL follows the transformer architecture ViLT and utilizes instance mask predicted by Mask R-CNN as additional input. Yang et al. propose Language-Aware Vision Transformer (LAVT) to conduct multi-modal fusion at intermediate levels of the network. CRIS employs CLIP pretrained on 400M image text pairs and transfers CLIP from text-to-image matching to text-to-pixel matching.

In this work we employ a fully attention-based architecture and propose a Vision-Language Transformer (VLT) to model the long-range dependencies in the image, as shown in Figure 2. We further propose to generate input-conditional queries for the decoder of transformer to better understand the unconstrained language expressions from different aspects.

2 Referring Comprehension

Referring comprehension is a highly relevant task to referring segmentation. Referring comprehension also takes an image and a language expression as inputs and identifies the target object referred by the language expression. However, while referring segmentation aims to output a segmentation mask for the target object, the referring comprehension outputs a grounding box. Unlike the FCN-like pipeline of referring segmentation, most earlier referring comprehension works are based on the multi-stage pipeline. In these works, often an out-of-the-box instance segmentation network, e.g., Mask R-CNN , is first applied to the image and generates a set of instance proposals, regardless of the language input. Next, the candidate proposals are compared with the language expression, to find the best match. For example, Yu et al. propose a two-stage method that first extracts all instances in the image using Mask R-CNN , then employs a modular network to match and select the target object from all the instances detected by Mask R-CNN. In recent years, one-stage methods have also been increasingly adopted in the referring comprehension area, e.g., Sadhu et al. propose a “Zero-Shot Grounding” network for referring comprehension , and Yang et al. design a recursive sub-query construction framework to gradually reason between the image and query language .

3 Transformer

Transformer is first proposed by Vaswani et al. in for machine translation, it is a sequence-to-sequence deep network architecture with the attention mechanism. Recently, transformer attracts lots of attention in Natural Language Processing (NLP) and achieves great success on many NLP tasks, e.g., machine translation , question answering , and language modeling . Besides the NLP, transformer has also been employed to many computer vision tasks and has achieved promising results on various vision tasks such as object detection , image recognition , semantic segmentation , and human-object interaction .

In the vision-language field, transformer architectures have achieved great success in many tasks, e.g., vision-and-language pre-training , image generation , visual question answering , open-vocabulary detection , image retrieval , vision-and-language navigation , etc. Lu et al. design a co-attention mechanism to incorporate language-attended vision features into language features. Kim et al. propose to deal with the two modalities in a single unified transformer architecture. Huang et al. propose a Pixel-BERT to align visual features with textual features by jointly learning visual and textual embedding in a unified transformer way. Based on the Pixel-BERT, Zareian et al. design a vision-to-language projection to process visual features before transformer. Radford et al. propose a Contrastive Language-Image Pre-training (CLIP) scheme to jointly train image language encoders. Wang et al. apply CLIP to referring image segmentation by text-to-pixel alignment. Lei et al. introduce a CLIPBERT to text-to-video retrieval and video question answering. Hu et al. propose a Unified Transformer (UniT) model to learn multiple vision-language tasks with a unified multi-modal architecture. Different from previous works that use a small fixed number of learned position embeddings as object queries, we propose to dynamically generate input-specific queries representing different comprehensions of language and selectively fuse the corresponding responses by these input-specific queries. With the input-specific queries, the proposed VLT better captures the informative clues hidden in the language expressions and address the high diversity in referring segmentation.

Methodology

The overall architecture of the proposed Vision-Language Transformer (VLT) is shown in Figure 2. The network takes a language expression and an image as inputs. First, the input image and language expression are projected into the linguistic and visual feature spaces, respectively. Then, vision and language features are inputted to the proposed Query Generation Module (QGM) to generate a set of input-specific query vectors, which represent different understandings of the language expression under the guidance of visual clues. Meantime, vision and language features are fused to multi-modal feature by the proposed Spatial Dynamic Fusion (SDF), and the multi-modal feature is sent to the transformer encoder to produce a group of memory features. The query vectors in Q generated by our proposed QGM are employed to “query” K and V derived from memory features in transformer decoder, i.e., $\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V$ , where $d_{k}$ is the dimensionality of $K$ . The resulting responses from transformer decoder are then selected by a Query Balance Module (QBM) with different confidence weights. Finally, the mask decoder takes the weighted responses from QBM and the output feature from transformer encoder as inputs and outputs a mask for the target object. Masked Contrastive Learning (MCL) is used to supervise the features in Mask Decoder to narrow down the features of different expressions for the same target object while distinguishing the features of different objects. Positional embeddings are used to supplement the pixel position information in the permutation-invariant transformer architecture.

After backbone features for language and image are extracted, the first step is to preliminarily fuse them together and generate a multi-modal feature. For referring segmentation, effective multi-modal information fusion is critical and challenging because the unconstrained expression of natural language and the diversity of objects in scene images bring huge uncertainty to the understanding and fusion of multi-modal features. However, existing approaches conduct multi-modal feature fusion simply by either concatenation or point-wise multiplication of vision feature and language feature. The language feature, which is a 1D vector, is usually tiled to every position of the vision feature , as shown in Figure 3. Under the “tile-and-concatenate” operation, the language feature is identically copied to every position across the $H\times W$ map.

Although such kinds of fusion techniques are simple and have achieved reasonable performance, there are a few drawbacks. Firstly, the features of individual single words are not fully utilized in this step. Secondly, the tiled language feature will be identical for all pixels across the image feature, which weakens the location information carried by the correlation between the language information and the visual information. Due to the diversity of objects in the input image, an image usually contains diverse information that can be very complex, where different regions may contain different semantic information. Meanwhile, the language expression can be interpreted with different emphases from different perspectives. We here emphasize the differences among pixels/objects, i.e., the vision information across the image varies from place to place. Therefore, the informative words in a given sentence are different from pixel to pixel. The way of tiling ignores such differences and simply assigns the same language feature vector to every pixel, resulting in some confusion. It is better to make tailored feature fusion specifically for each individual pixel. In this work, we propose a Spatial-Dynamic Fusion (SDF) module, which produces different language feature vectors for different positions of the image feature according to the interaction between language information and corresponding pixel information. Each position selects its interested words and pays more attention to these words during multi-modal fusion.

An illustration of the proposed Spatial-Dynamic Fusion (SDF) module is shown in Figure 4. The proposed SDF module takes language features $F_{t}$ , including features of each word and the whole sentence, and image feature $F_{vr}$ as inputs. We first use language feature and vision feature to generate the Spatial-Dynamic Attention matrix by:

where $\copyright$ denotes concatenation. Following previous transformer works , we employ fixed sine spatial positional embeddings to supplement the pixel position information in the permutation-invariant transformer. The fused multi-modal feature and the positional embeddings are inputted to the transformer encoder (see Figure 2).

2 Query Generation Module

In most existing Vision Transformer works, e.g., , queries for the transformer decoder are usually a set of fixed learned vectors, each of which is used to predict one object and has its own operating mode, e.g., specifying objects of a certain kind or located in a certain region. These works with fixed queries have an implicit assumption that objects in the input image are distributed under some statistic rules. However, such an assumption does not consider the huge diversity of the referring segmentation. Besides, the learnable queries are designed for detecting all objects in the whole image instead of focusing on the target object indicated by the language expression, thus cannot effectively extract informative representation that contains clues of the target object.

For referring segmentation, the target object described by the given expression can be any part of the image. Because both the input image and language expression are unconstrained, the stochasticity of the target object’s properties is significantly high. Therefore, fixed query vectors, like in most existing ViT works, cannot well represent the properties of the target object. Instead, the properties of the target object are hidden in the input language expression, e.g., keywords like “blue/yellow”, “small/large”, “right/left”, etc. To capture the informative clues and address the high stochasticity in referring segmentation, we propose a Query Generation Module (QGM) to adaptively generate the input-specified query vectors online according to the given image and language expression. Also, it is well known that for a language expression, the importance of different words is different. Some existing works address this issue by measuring the importance of each word. For example, gives each word a weight and defines a set of groups, e.g., location, attribute, entity, and finds the degree of each word belonging to different groups. Most works derive the weights by the language self-attention, which does not utilize the information in the image and only outputs one set of weights. But in practice, a same sentence may have different understanding perspectives and emphasis, and the most suitable and effective emphasis can only be known with the help of the image. We give an intuitive example in Figure 5. For the same input sentence “The large balloon on the left”, the word “left” is more informative for the first image while the word “large” is more useful for the second image. In this case, language self-attention cannot differentiate the importance between “large” and “left”, making the attention process less effective. In order to let the network learn different aspects of information and enhance the robustness of the queries, we generate multiple queries with the help of visual information, though there is only one target instance. Each query represents a specific comprehension of the given language expression with different emphasized word(s).

Next, we comprehend the language expression from multiple aspects incorporating the image, forming $N_{q}$ queries from language. We derive the attention weights for language features $F_{t}$ by incorporating the vision features $F_{vq}$ ,

Next, the derived attention weights are applied to the language features:

3 Query Balance Module

We get $N_{q}$ different query vectors from the proposed Query Generation Module. Each query represents a specific comprehension of the input language expression under the interactive guidance of the input image information. As we discussed before, both the input image and language expression are of high arbitrarines. Thus, it is desired to adaptively select the better comprehension ways and let the network focus on the more reasonable and suitable comprehension ways. On the other hand, as the independence of each query vector is kept in the transformer decoder but we only need one mask output, it is desired to balance the influence of different queries on the final output. Therefore, we propose a Query Balance Module (QBM) to dynamically assign each query vector a confidence measure that reflects how much it fits the prediction and the context of the image.

The architecture of the proposed QBM is shown in Figure 8. Specifically, the inputs of Query Balance Module are the query vectors $F_{q}$ from the Query Generation Module and its corresponding responses from the transformer decoder, $F_{r}$ , which is of the same size as $F_{q}$ . In the Query Balance Module, the query vectors after going through a linear layer and their corresponding responses are first concatenated together. The linear layers are employed to derive confidence levels according to the query vectors $F_{q}$ and their corresponding responses $F_{r}$ . Then, a set of query confidence levels $C_{q}$ , in the shape of $N_{q}\times 1$ , are generated by two consecutive linear layers. Sigmoid, $S(x)=\frac{1}{1+e^{-x}}$ , is employed after the the last linear layer as an activation function to control the output range. Let $F_{rn}$ and $C_{qn}$ denote the corresponding response and query confidence to the $n$ -th query $F_{qn}$ , respectively. Each scalar $C_{qn}$ shows how much the query $F_{qn}$ fits the context of its prediction, and controls the influence of its response $F_{rn}$ to the mask decoding. Each response $F_{rn}$ is multiplied with the corresponding query confidence $C_{qn}$ , i.e., $F_{bn}=F_{rn}C_{qn}$ . The balanced responses $F_{b}=\{F_{b1},...,F_{bn},...,F_{bN_{q}}\}$ are sent for mask decoding. The proposed QGM dynamically produces input-specific queries that focus on different informative clues related to the target object, while the proposed QBM selectively fuses the corresponding responses to these queries. These two modules work together to prominently boost the diversity to understand the image and query language, and enhance the model’s robustness towards highly stochastic inputs.

4 Mask Decoder

The output of the Query Balance Module $F_{b}$ with the size of $N_{q}\times C$ is sent to the mask decoder, as shown in Figure 9. In the mask decoder module, $F_{b}$ is utilized as a set of mask generation kernel to process the vision-dominated feature $F_{ve}$ from the transformer encoder, to produce mask feature $F_{m}$ , i.e.,

where $F_{ve}$ is with size of $HW\times C$ so that $F_{m}$ has size of $HW\times N_{q}$ . Then we reshape $F_{m}$ to $H\times W\times N_{q}$ for the final mask generation. We use three stacked $3\times 3$ convolution layers for decoding followed by one $1\times 1$ convolution layer for outputting the final predicted segmentation mask. To control the output size and generate a higher-resolution mask, upsampling layers are placed after each of the three $3\times 3$ convolution layers. To better demonstrate the effectiveness of the proposed transformer module, the Mask Decoding Module in our implementation does not utilize any CNN features. We employ the Binary Cross-Entropy loss on the predicted masks to supervise the network training.

5 Masked Contrastive Learning

Here we further consider inter-sample learning to explicitly endow the model with the knowledge of different language expressions to one object. The given expression in natural language is unconstrained. There are multiple ways to describe the same target object, which brings challenges in understanding these expressions. Given an image $I$ that contains $N_{O}$ objects $\{O_{1},...,O_{i},...,O_{N_{O}}\}$ , every object $O_{i}$ in $I$ can be referred by ${N_{E}}$ different expressions $\{E_{i}^{1},...,E_{i}^{j},...,E_{i}^{N_{E}}\}$ . A sample $S(I,O_{i},E_{i}^{j})$ of referring segmentation defines a mapping from an expression to the target object: $\{E_{i}^{j}\rightarrow O_{i}|I\}$ . The mappings between $E_{i}$ and $O_{i}$ are in general many-to-one. An object in the image can be described by many different language expressions, but one language expression should unambiguously point to one and only one instance. Thus, no matter what kind of expressions are given to the target object, the final mask is the same, i.e., the feature $F_{m}$ in Eq. (7) for generating the final mask is the same. Motivated by this, here we introduce contrastive learning that forces the network to narrow the distance of features of different expressions for the same target object while enlarging the distance of features for different objects. Furthermore, to provide more positive pairs and enhance the model’s generalization ability to the input language, we randomly mask some specific words in the language expression and add these masked expressions to the positive samples of the original expression.

To sample the training pairs for contrastive learning, we summarize the inter-sample relationships into three categories: 1) Different Image (DI), 2) Same Image Different Object (SIDO), 3) Same Image Same Object (SISO), as shown in Figure 10. Unlike existing methods that construct the training batches in a fully random manner, we intend to let one batch have all kinds of inter-sample relationships. Firstly we randomly choose an initial sample $S_{init}$ , as shown in Figure 11. We denote its SISO images as $S_{SO}$ , whose image $I$ and object $O_{i}$ are the same as $S_{init}$ but expressions $E_{i}^{j}$ are different from $S_{init}$ , and denote its SIDO images as $S_{DO}$ , which has the same image $I$ as the initial sample but different target object $O_{i}$ . When constructing a mini-batch, we first put the initial sample into it. Next we intentionally put at most $N_{SO}$ samples from $S_{SO}$ , and at most $N_{DO}$ samples from $S_{DO}$ . The rest of the batch is filled with the randomly chosen DI samples. Under this mechanism, every training batch will contain all kinds of inter-sample relationships, as shown in Figure 11.

As we mentioned earlier, the features of SISO samples for generating the final mask should be the same. In contrast, for SIDO items, though they share the same input image so the output feature of the transformer may tend to be similar, the features for generating the mask prediction should be different because their target outputs are different. From this point, we introduce contrastive learning as feature-level supervision. In our approach, the Mask Decoder module plays the role in generating the output mask, hence we add the contrastive learning on the feature $F_{m}$ , see Eq. (7), of the Mask Decoder module. Inspired by the InfoNCE loss , our loss is defined as follows:

where $S_{+}$ denotes a SISO sample of the initial sample, $\tau$ is a temperature constant, $f_{S}$ is the feature $F_{m}$ in the Mask Decoder module, and $\left<,\right>$ denotes the cosine similarity function. This loss function forces that the mask feature of the initial sample to be closer to its SISO samples that are supposed to have the identical output feature and mask, and force it to be away from its SIDO samples, which are supposed to have a non-overlap mask with it.

6 Network Architecture

Experiments

We conducted extensive experiments to demonstrate the effectiveness of our proposed Vision-Language Transformer (VLT) for referring segmentation. In this section, we introduce implementation details of our approach, benchmarks we used in the experiments, and report both the quantitative and qualitative results of our proposed approach compared with other state-of-the-art methods.

Experiment Settings. Following previous works , we use the same experiment settings. Our framework utilizes Darknet-53 pretrained on partial MSCOCO as the visual CNN backbone. Images form the validation and test set of the RefCOCO series are excluded in the pretraining. We use bi-GRU as the RNN implementation and the Glove Common Crawl 840B for word embedding. The training image size is set to $416\times 416$ pixels. Each Transformer block has eight heads, and the hidden layer size in all heads is set to 256. For RefCOCO and RefCOCO+, we set the maximum word number to 15, and for G-Ref, we set it to 20 as there are more long sentences. The Adam optimizer is used to train the network for 50 epochs, and the learning rate is set to $\lambda$ = $0.001$ . The batch size is 32 on one 32G V100 GPU.

Metrics. We use two metrics in our experiments: mask Intersection-over-Union (IoU) and Precision with thresholds (Pr@ $X$ ). The mask IoU demonstrates the mask quality, which emphasizes the model’s overall performance and reveals both targeting and segmenting abilities. The Pr@ $X$ metric computes the ratio of successfully predicted samples using different IoU thresholds. Low threshold precision like Pr@0.5 reflects the identification performance of the method, and high threshold precision like Pr@0.9 reveals the ability of generating high-quality masks.

2 Datasets

The proposed VLT is evaluated on three public referring segmentation datasets: RefCOCO, RefCOCO+ and G-Ref .

RefCOCO & RefCOCO+ are two of the largest image datasets for referring segmentation. They are also called UNC & UNC+ datasets in some literature. 142,209 referring language expressions describing 50,000 objects in 19,992 images are collected in the RefCOCO dataset, and 141,564 referring language expressions for 49,856 objects in 19,992 images are collected in the RefCOCO+ dataset. The difference between two datasets is that the RefCOCO+ restricts the expression ways for the language sentences. For example, descriptions about absolute locations, e.g., “leftmost”, are forbidden in the RefCOCO+ dataset.

G-Ref . Also called RefCOCOg, it is another famouse and well recognized referring segmentation dataset. 104,560 referring language expressions for 54,822 objects in 26,711 images are used in G-Ref. Unlike RefCOCO & RefCOCO+, the language usage in the G-Ref is more casual but complex, and the sentence lengthes of G-Ref are also longer in average. Notably, G-Ref has two versions: one is called UMD split , the other is called Google split . The UMD split has both validation and testing set publicly available, but the Google split only makes its validation set public. We report the results of the proposed VLT on both UMD and Google version.

3 Ablation Study

In this section, we conduct ablation studies on the test B of RefCOCO to demonstrate the effectiveness of the proposed modules in our Vision-Language Transformer framework.

Transformer v.s. ConvNet. To demonstrate the scale of our proposed network and verify the effectiveness of the transformer module, we compare our method with a regular ConvNet in terms of the performance and parameter size in TABLE I. In the experiment, we replace the whole transformer-based modules, including the transformer encoder-decoder, the Query Generation Module, and the Query Balance Module with seven stacked $3\times 3$ Conv layers that have similar parameters size to our transformer-based modules. It shows that the parameter size of our transformer-based module achieves a much superior performance while is only nearly equal to 7 convolutional layers. The transformer module outperforms the 7 Conv module with $\sim$ 5% margin in terms of IoU, and $\sim$ 7% margin in terms of Precision@0.5. This proves the effectiveness of the proposed transformer module.

Query Generation. In TABLE II, we compare different kinds of query generation methods, including our proposed Query Generation Module (QGM), language features $F_{t}$ as queries, and learned parameters as queries. The Query Generation Module outperforms the other two methods with a large margin at about 5% - 7% in terms of IoU and 4% - 6% in terms of Pr@0.5. Firstly, we directly utilize the language features $F_{t}$ as query vectors and send them into the transformer decoder. In detail, the given language expression is processed by an RNN network, then the output for every word, and the output for the whole sentence, are used as query vectors. It can be seen in TABLE II that the performance of $F_{t}$ as queries is $\sim$ 5% worse than QGM, which is because the information between words is not sufficiently exchanged and the understanding of language is derived from language itself, as we discussed in Sec 3.2. The proposed Query Generation Module has a much superior performance to the “ $F_{t}$ ” as queries. This demonstrates that the proposed QGM effectively understands the language expressions and produces valid attended language features under the guidance of visual information. We set 16 query vectors that are initialized with uniform distribution at the beginning of the training in our experiment, and train these query-parameters together with the network. As the “learnt” in TABLE II, the performance of these learned fixed query vectors is not satisfying, only 58.50%, which shows that such learned query-parameters cannot represent the target object as effectively as online generated input-specific queries by the proposed QGM.

Query Number $N_{q}$ . To demonstrate the influence of the query number $N_{q}$ on the results, we evaluate the network’s results with different numbers of query vectors. As we can see in TABLE III and Figure 12, though only one segmentation mask is required in the final prediction, multiple queries are desired for providing diverse clues and can achieve better results than a single query. As shown in TABLE III and Figure 12, by increasing the query number $N_{q}$ , the performance gradually gets higher, and a significant performance gain of about 8% is achieved from 1 query to 16 queries. The performance gain slows down after the query number $N_{q}$ is larger than 8, therefore we select $N_{q}$ = 16 as the default setting. The performance gain achieved by larger $N_{q}$ verifies that multiple input-specific queries produced by the proposed QGM dynamically represent the diverse comprehensions of language expression. When the Query Balance Module (QBM) is discarded, marked with $\ddagger$ in TABLE III, a performance drop of 1.44% IoU is observed, which proves the advantage of the proposed QBM.

Tile v.s. Spatial-Dynamic Fusion. In TABLE IV, we compare the “tile-and-concatenate” fusion and our proposed Spatial-Dynamic Fusion (SDF). As we discussed in Section 3.1, the “tile” operation does not consider the difference of each pixel but uses an identical sentence feature for all pixels across the image. In contrast, the proposed spatial-dynamic fusion customizes a unique language feature for every pixel according to the interaction between language information and corresponding pixel information. As shown in TABLE IV, compared with “Tile”, the SDF module brings a performance gain of 0.84% IoU and 1.23% Pr@0.5. The proposed SDF emphasizes the differences among pixels/objects and allows each position to select the more informative words, enhancing the multi-modal fusion and producing better multi-modal features. “Tile + Conv $\times 4$ ” in TABLE IV, which has the same number of parameters as the proposed SDF, does not bring better performance than “Tile” because our network already has a sequential convolution layers after the feature fusion.

Inter-Sample Learning. Here we demonstrate the effectiveness of our proposed inter-sample learning approach, Masked Contrastive Learning (MCL). The results are shown in TABLE V(b). Firstly, we add Contrastive Learning (CL) in the training of our network. The CL does not contain masked sentences as SISO samples. From TABLE V(a), on the original testing set, the CL brings a performance gain of 1.27% in terms of IoU and 1.04% in terms of Pr@0.5, which demonstrates that the inter-sample learning does enhance the model’s performance. Further, we introduce the samples with masked sentences as SISO samples, i.e., positive pairs in contrastive learning. Compared with w/o CL, MCL brings a large performance gain of 1.81% IoU on the original dataset. Compared with CL, MCL further brings a performance gain of 0.54% in terms of IoU and 0.51% in terms of Pr@0.5, which shows the benefits brought by introducing samples with masked expressions in training. To better demonstrate the model’s ability in dealing with unconstrained and diverse language expressions in the wild, we do another two testings: 1) erase some informative words of the given language expressions in these testing samples, see “Masked” in TABLE V(a); 2) cross datasets validation between two datasets that have different common clues, i.e., training on RefCOCO while testing on the validation set of RefCOCO+, marked with w/o CL†, etc. in TABLE V(b). Firstly, as shown in TABLE V(a), compared with the original dataset, w/o CL drops 3.9% in terms of IoU and 4.82% in terms of Pr@0.5 on the Masked testing samples. The result shows that the w/o CL model overly relies on common keywords and is heavily affected by the missing of these common clues. While for w/ MCL, the performance drop on the Masked validation is 1.04% and is much less than the performance drop of w/o CL, which verifies the model’s robustness and generalization ability brought by introducing masked contrastive learning. Next, we do the cross-dataset validation on RefCOCO and RefCOCO+ in TABLE V(b). In RefCOCO, a large number of samples use absolute location (e.g., “the left”, “on the right”, etc.) for describing the target object, but such kinds of expressions are not allowed in the RefCOCO+. Therefore, the cross datasets validation provides a good simulation of a practical scenario, in which the training information and testing are inconsistent, and only partial clues are available for testing. As shown in TABLE V(b), w/ MCL† outperforms w/o CL† 3.19% in terms of IoU and 4.35% in terms of Pr@0.5, which verifies the model’s robustness and generalization ability in dealing with diverse language expressions that are different from training samples. “Native” in TABLE V(b) denotes training & testing on RefCOCO+. As w/ MCL† v.s. “Native”, we can see that the model trained on RefCOCO with MCL achieves competitive results on the validation set of RefCOCO+ compared to the model trained on RefCOCO+, proving that the proposed masked contrastive learning enhances the model’s generalizability under open-world practical scenarios.

Next we do an ablation study about the word selection mechanism in our masked constrastive learning. Apart from the baseline model that disables MCL, we test three mask-word selection methods: 1) randomly choose a word to mask, 2) randomly choose a word with the weight $a_{i}$ greater than a threshold $\theta$ to mask, 3) the proposed method that words are masked based on the probability $p_{m}$ . TABLE VI shows that our method outperforms other mask word selection mechanisms.

For the setting of $N_{DO}$ in MCL, the ablation study in Figure 13 shows that the performance of the network reaches the peak when $N_{DO}$ is set to $10\%$ of the batch size. For $N_{SO}$ , as the average number of expressions for an object is around 3, we can include all available Same Object (SO) samples in most cases.

In Figure 14, we provide qualitative examples to show the effectiveness of the Masked Contrastive Learning. The original input language expression contains information in two aspects: color (“black”) and attribute (“coat”). The model without MCL overly relies on the more obvious color information (“black”), so it fails to predict when the word is erased. In contrast, the model with MCL successfully finds the target with partial information, showing that the MCL enhances the model’s generalization ability to various language expressions.

We further test the training efficiency of our mask contrastive learning approach. We train the network with and without the MCL and report the GPU memory usage during training and the training speed of two runs with batch size set to 16. With MCL enabled, the GPU memory usage and average training speed are 18496MB and 0.479s/iter, respectively. Without MCL, they are 17842MB and 0.471s/iter, respectively. The increased training memory and time by MCL are less than $4\%$ and $2\%$ , respectively.

4 Comparison with State-of-the-art Methods

Here we compare the proposed Vision-Language Transformer (VLT) framework with previous state-of-the-art referring image segmentation methods on three commonly-used benchmarks, RefCOCO, RefCOCO+, and G-Ref. The results are reported in TABLE VII. It can be seen that the proposed VLT outperforms previous state-of-the-art methods on all three benchmarks. On RefCOCO, the IoU performance of the proposed VLT is better than other methods, e.g., LTS , with $\sim$ 2% gain on three different testing splits. Then on RefCOCO+, the proposed VLT achieves new state-of-the-art result and is around 2% better than previous state-of-the-art method. On the hard benchmark G-Ref that has longer language expressions, the proposed VLT consistently achieves new state-of-the-art referring segmentation performance with an IoU improvement of about 0.5%-3%, which demonstrates that the proposed VLT has good abilities to deal with hard cases and long expressions. We assume the reason is that, on the one hand, long and complex expressions usually contain more clues and more emphasis, and our proposed Query Generation Module and Query Balance Module can produce multiple comprehensions with different emphases and find the more suitable ones. On the other hand, harder cases also contain complex scenarios that need a holistic view and understanding of the given language expression and image, and the multi-head attention is more appropriate for such complex scenarios as a global operator. Meantime, compared with other methods that with stronger backbones, e.g., DeepLab-R101 , MaskRCNN-R101 , ResNet101 , our backbone Darknet53 and our proposed modules are lightweight.

To compare with methods using stronger backbones, we further provide results with stronger visual and textual encoders in TABLE VII. We use the popular vision transformer backbone Swin-B as visual encoder and BERT as textual encoder to replace the Darknet53 and bi-GRU , respectively. Methods pretrained on large-scale vision-language datasets are marked with $\dagger$ , e.g., MaIL adopts ViLT pre-trained on four large-scale vision-language pretraining datasets and CRIS employs CLIP pretrained on 400M image-text pairs. As shown in TABLE VII, the proposed approach outperforms MaIL and CRIS by around $2\%\!\sim\!4\%$ IoU without using large-scale vision-language datasets in pretraining, which demonstrates the effectiveness of our proposed modules with stronger visual and textual encoders. Especially, the proposed approach VLT achieves higher performance gain on more difficult dataset G-Ref that has a longer average sentence length and more complex and diverse word usages, e.g., VLT is $\sim\!\!4\%$ IoU better than MaIL and LAVT on test ${}_{\text{(U)}}$ of G-Ref. It demonstrates the proposed model’s good ability in dealing with long and complex expressions with large diversities, which is mainly attributed to input-conditional query generation and selection that well cope with the diverse words/expressions, and masked contrastive learning that enhances the model’s generalization ability.

5 Qualitative Results and Visualization

In Figure 15(a), we extract and visualize an attention map for a position “ $P$ ” from the 2nd layer of our transformer encoder. It shows that in a single layer of the transformer, the attention of one output pixel globally extends to other input pixels far away. We also see that pixel on one instance attends to other instances, showing our network is able to capture long-range interactions between instances. In Figure 15(b), we visualize four query vectors $F_{q}$ (see Figure 6 and Eq. (6)). The four query vectors differ from each other and have different distributions of response peaks, which demonstrates the diversity of these input-specific query vectors.

Then, we visualize some qualitative examples of the proposed VLT in Figure 16. To demonstrate the identifying ability of our VLT, we show the mask predictions of two different input language expressions for every example. Image (a) and (c) are two typical examples that the language expression directly provides the location or color clues of the target object. In the second expression of Image (c), “lighter color cat”, it can be seen that the proposed VLT is able to handle the expressions that indicate the target object by providing a comparison of it with other objects, e.g., “lighter”. The examples of image (b) and (d) demonstrate the model’s ability on understanding the attribute words, e.g., “stripes”, and relatively rarer words, e.g., “floral”. In the second expression of image (e), our VLT successfully identifies the target object referred by expression describing the relationships between objects, i.e., “Elephant with rider”. Image (f) contains a group of people, where all instances distribute densely in a complicated layout. The proposed method manages to identify the target instance with difficult language expressions that contain multiple aspects of clues, such as direction (“9 o’clock”), attributes (“white coat” & “gray suit”), and posture (“kneeling”).

6 Results on Referring Video Object Segmentation

Our proposed approach can also be applied to referring video object segmentation (RVOS) task with minor adaptation. We apply our model on each individual frame of the input video clip. We use the average vision features of all frames of a video clip as the vision features in the QGM ( $F_{vq}$ in Figure 6). This enables the query input to be kept identical for all frames in a video clip, achieving a temporal consistency across frames. When performing the contrastive learning on video data, we sample different objects $S_{DO}$ in the $\pm 2$ adjacent frames of the initial object. As adjacent frames shares similar image structure with the original frame, we can enlarge the number of negative samples while keeping a similar behavior with our image model. According to the experiments, when only sampling $S_{DO}$ in the same video frame, the $\mathcal{J}\&\mathcal{F}$ performance is 63.5 while it increases to 63.8 when adding the $\pm 2$ adjacent frames.

In TABLE VIII, we report the quantitative results of the proposed VLT on the validation set of the YouTube-RVOS dataset and Ref-DAVIS17 dataset. YouTube-RVOS is a large-scale referring video object segmentation benchmark, containing 3,978 video clips with around 15K language expressions. Ref-DAVIS17, building based on DAVIS17 , contains 90 video clips. The results are reported with three standard evaluation metrics: region similarity $\mathcal{J}$ , contour accuracy $\mathcal{F}$ , as well as the mean value of the two metrics $\mathcal{J\&F}=(\mathcal{J}+\mathcal{F})/2$ .

To ensure a fair comparison, we use the Base model of Video Swin Transformer (V-Swin-B) as the backbone, the same as ReferFormer . “Ensemble” denotes visual encoder ensemble of three backbones, including ResNet101 , HRNet , and ResNeSt101 . As shown in TABLE VIII, although we do not design specific modules and training losses for RVOS like in ReferFormer , the proposed VLT achieves new state-of-the-art RVOS results consistently on both the YouTube-RVOS and Ref-DAVIS17, which demonstrate the effectiveness of the proposed VLT on referring video object segmentation.

Conclusion

In this work, we address the challenging multi-modal task of referring segmentation by introducing transformer to facilitate the long-range information exchange that is difficult to achieve in conventional convolutional networks. We reformulate referring segmentation as a direct attention problem and propose a Vision-Language Transformer (VLT) framework that exploits the transformer to perform attention operations. To emphasize the differences among pixels/objects, we introduce a spatial-dynamic multi-modal fusion to produce a specific language feature vector for each position of the image feature according to the interaction between language information and corresponding pixel information. To solve the problem of ambiguous referring expressions because of the unknown emphasis, we propose a Query Generation Module and a Query Balance Module to comprehend the referring sentence better with the help of the referred image information. These two modules work together to prominently improve the diversity of ways to understand the image and query language. We further consider inter-sample learning to explicitly endow the model with knowledge of understanding different language expressions of one object. Masked contrastive representation learning is proposed to narrow down the features of different expressions for the same target object while distinguishing the features of different objects, which significantly enhances the model’s ability in dealing with diverse language expressions in the wild. The proposed model is lightweight and achieves new state-of-the-art performance on three public referring image segmentation datasets and two referring video object segmentation datasets.