Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation

Chang Liu, Henghui Ding, Yulun Zhang, Xudong Jiang

I Introduction

Referring image segmentation aims at generating mask for the object referred by a given language expression in the input image . Since being proposed in 2016 , this problem has been widely discussed by many researchers, while there are still a lot of issues remaining to be addressed. One of the biggest challenges is that this task requires the reasoning of multiple types of information like vision and language, but the unconstrained expression of natural language and the diversity of objects in scene images bring huge uncertainty to the understanding and fusion of multi-modal features.

Recently, the attention-based network has become an attractive framework for building vision models. Originally introduced for Natural Language Processing (NLP) tasks, Transformer is naturally suitable for solving multi-modal tasks, especially CV-NLP tasks like referring segmentation. Most previous works utilize the generic attention mechanism to model the relationship between language and vision information. The generic attention mechanism highlights the most relevant image region for each word in the language input, as shown in the right part of Figure 1 (a). By aggregating the input vision features according to the generated attention weights, as shown in the blue path in Figure 1 (b), the derived feature can describe each word using the combination of vision features. As the language feature is only used for calculating the attention weights, we call it language-attended vision feature (LAV).

The aforementioned attention mechanism is useful for processing vision information. However, since the referring segmentation is a multi-modal task, the language information is also essential. Thus, for processing the language information, a natural way is to introduce another type of attention that outputs language features. For each pixel in the image, we can find the words that are most relevant to it, as shown in the left part of Figure1 (a). By aggregating the features of these words together according to the attention weights, a set of vision-attended language features (VAL) for each image pixel can be derived. In contrast to LAV which is a set of vision features, VAL describes each pixel using language features. However, both VAL and LAV have limitations: they are both essentially single-modal features and only represent a part of the multi-modal information. For example, VAL is a set of language features for describing pixels, but the inherent vision feature of each pixel itself is not preserved. We argue that a holistic and better understanding of multi-modal information can be get by fusing features of two modalities together. However, this is not achievable in the generic single-modal attention mechanism.

Overall, the contributions of this work can be summarized as follows:

The proposed approach achieves new state-of-the-art referring image segmentation performance on RefCOCO series datasets consistently.

II Related Works

In this section, we discuss methods that are closely related to this work, including referring segmentation and transformer.

Referring segmentation is inspired by the task of referring comprehension . Different from semantic segmentation based on pre-defined categories, referring segmentation predicts segmentation mask according to a given language expression. Defined in , Hu et al. introduce the classic one-stage method for referring segmentation. They firstly extract features from image and language respectively, then fuse them together and apply a FCN (Fully Convolutional Network) on the fused feature. In , Liu et al. propose a recurrent model that utilizes word features in the sentence. and use the language feature to generate a set of filter kernels, then apply them on the image feature. Later, Yu et al. propose to add the word features to derive the attention weights in the later stage of the network after the tile-and-concatenate preliminary fusion is done . They design an attention module like that utilizes the word features on the multi-modal feature, achieving remarkable performance. With a similar pipeline, in , Hu et al. propose a bi-directional attention module to further utilize the features of words. In , Hui et al. propose to analyse the linguistic structure for better language understanding. Yang et al. use explainable reasoning. Luo et al. propose a novel pipeline that merges referring segmentation and referring comprehension together, but in terms of language feature fusion, it still uses a similar multi-modal fusing technique as . Some other works propose special language usages, for example, Yu et al. adopt a two-stage pipeline like referring comprehension methods . Ding et al. use the language feature to generate query vectors for a transformer-based network. Feng et al. propose to utilize the language feature earlier in the encoder stage. Kamath et al. use transformer-based backbones for processing language inputs. Most recently, CRIS proposes to use multi-modal large model CLIP to address the referring segmentation task. Yang et al. and Kim et al. designed more advanced transformer architectures, and achieved impressive performance. However, for most of the previous works, a common point is that their language information is injected into the multi-model feature at some certain “steps”. For example, in the earlier works there is only a one-time fusion and all subsequent operations are applied on the fused features. For most recent works, the language information is used twice: one for tile-and-concatenate preliminary fusion and the other as auxiliary information like attention module inputs. Instead, the language information in our network is iteratively utilized through the whole prediction process, establishing an in-depth interaction between features from two modalities. Besides, most previous networks are unaware of whether the language information is lost during the propagation of the network. Our network ensures that the language information is kept until the rear stage of the network, promoting it to fully interact with information from the other modality.

II-B Transformer

Transformer is firstly introduced by Vaswani et al. for Natural Language Processing (NLP) task . Quickly it becomes a popular sequence-to-sequence model in the NLP area. Thanks to its strong global relationship modeling ability, it was migrated into the Computer Vision (CV) area recently and has achieved good performance in many tasks, such as image classification , deblurring , object detection , semantic segmentation , instance segmentation , and video segmentation . Its good performance in various areas also suggests its potential in handling multi-modal information , and there are several works on multi-modal transformers. For example, Radford et al. design a network that uses the natural language to supervise a vision model . Kim et al. propose a large scale pretrained model for vision-language tasks . However, most of the relevant works are built upon the generic transformers that are originally designed for a single modality, e.g., language or vision . These methods are not optimized for processing multi-modal information, so they lack some functions for multi-modal features, for example, feature fusion. In this work, we propose a Mutual Attention mechanism, which is designed for multi-modal features. It accepts inputs from multiple modalities, enables them to interact with each other, and densely fuses them together, so as to output a true multi-modal feature.

III Methodology

The overview architecture of our proposed approach is shown in Figure 2. The network’s inputs include an image $I$ , and a language expression $T$ containing $N_{t}$ words. Following previous works , we first extract two sets of input backbone features: image feature $F_{vis}$ from $I$ using a CNN backbone, and language feature $F_{t}$ and $F_{t}^{\prime}$ from $T$ using a bi-directional LSTM. The image feature, $F_{vis}$ has the shape of $H\times W\times C$ , where $H$ and $W$ denote height and width respectively, $C$ is the number of channels. For the language feature, the hidden states of the LSTM $F_{t}\in R^{N_{t}\times C}$ represent the feature for each word, while the final state output $F_{t}^{\prime}$ is used as the representation of the whole sentence. The channel number of language features is also $C$ for the ease of fusion.

As mentioned above, most previous works use the generic attention mechanism for processing multi-modal information. Figure 3 (a) gives an example of such kind of mechanism, similar to . Features from two modalities (query and key) are used to derive an attention matrix, that is then used to aggregate the vision feature for each word. In this process, the language feature is only used to generate the attention weights that indicate the significances of regions in the vision feature. Hence, language information is not directly involved in the output so that the output can be viewed as a reorganized single-modal vision feature. Even worse, this single-modal vision output is used alone as a query in the successive transformer decoder, dominating information in decoder. As a result, language information will be dramatically lost in the decoder. Thus we argue that the generic attention mechanism is good for processing features from the value input, but it lacks the ability of fusing features from two modalities. So, if it is used to process multi-modal information, the query input is not fully utilized, and features of two modalities are not densely fused and interacted.

which is a multi-modal attention matrix with shape of $N_{t}\times HW$ , describing the relationship strength from all elements of one modal to all elements of the other modal. $\frac{1}{\sqrt{C}}$ is the scaling factor . Then unlike the generic attention that only applies the attention matrix on one modal, we normalize the mutual attention matrix in both axes and apply it on features from the both modals:

Language-attended vision feature (LAV), $F_{V}^{a}$ : Softmax normalization is applied along each $HW\times 1$ axis of the mutual attention matrix $A_{mut}$ , as in Eq. (2a), which is then applied on the vision feature $F_{V}^{v}$ to get the language-attended vision feature $F_{V}^{a}\in R^{N_{t}\times C}$ . There are $N_{t}$ feature vectors in $F_{V}^{a}$ , where each vector represents one attended vision feature corresponding to one element (word) in the language input. In other words, each vector is the vision feature weighted by a word based on its interpretation to the image. It is similar to the output of the generic attention mechanism. As the language features only participate in the attention matrix, the output is essentially still a single-modal feature.

Vision-attended language feature (VAL), $F_{L}^{a}$ . Another softmax normalization is applied on the transposed mutual attention matrix, ${A_{mut}}^{T}$ , along the $N_{t}\times 1$ axis, as in Eq. (2b). By applying the attention matrix on the language feature $F_{L}^{v}$ , we get the vision-attended language feature $F_{L}^{a}\in R^{HW\times C}$ . $F_{L}^{a}$ contains $HW$ feature vectors, where each vector represents one attended language feature corresponding to one pixel in the vision feature. In other words, $F_{L}^{a}$ is a spatial-dynamic language feature, each vector of $F_{L}^{a}$ is the language feature weighted based on a pixel’s interpretation of the sentence.

III-B Iterative Multi-Modal Interaction

Besides, as features are propagated to higher layers, the model’s understanding of language information becomes deeper. This also causes that different layers will focus on different types of information. For example, features from lower layers do not have a contextual understanding of the relationship between language and image so that they desire more specific clues, while features from higher layers need more holistic information as they already have a better understanding of image and language. Therefore, it is desired to transform the language features along with the processing of the multi-modal feature.

Next, we project the multi-modal feature using a linear layer, and compute an attention matrix for reorganizing the language features:

III-C Language Feature Reconstruction

In most referring segmentation methods, the network is only supervised by the output mask loss. This implies a hypothesis: as long as the output mask matches the target object, we consider that the model has successfully understood the language information. However, this is not always true in real-world scenarios. For example, it is assumed in most datasets that there is always one and only one object in the ground-truth segmentation mask for each training sample. The network can easily learn such kind of data bias and always output one object. Therefore, for some training samples, if the network happens to “guess” the correct target even if the language information has been lost during the propagation, these training samples may not properly contribute to the training of the network, or even be harmful to the network to generalize.

where $\copyright$ is concatenation. The both $\copyright$ and $\sum$ are conducted along the sequence length dimension (i.e., $[(F_{t}+e)\ \text{\footnotesize\copyright}\ F_{t}^{\prime}]\in R^{(N_{t}+1)\times C}$ ). $e$ denotes the cosine positional embedding, which adds information about the order of words in the sentence. $W_{proj}\in R^{C\times C}$ is learnable parameters for projection, and $N_{t}$ is the length of the sentence for normalization.

III-D Mask Decoder and Loss Function

The output mask is supervised by the Binary Cross Entropy Loss. The final loss function is defined as:

where $w_{mask}$ is the weight for the mask loss $\mathcal{L}_{mask}$ and $w_{rec}$ is the weight for the Language Feature Reconstruction loss $\mathcal{L}_{rec}$ . The proposed LFR does not directly participate in the mask prediction and is computationally free during inference. It can work as a plug-in module to any existing referring segmentation methods.

IV Experiments

In this section, we report the experimental results of our method in comparison with previous state-of-the-art methods, and the ablation studies that verify the effectiveness of our proposed modules. We evaluate the performance by two commonly used metrics: the IoU score measures the rate of Intersection over the Union between the model’s output mask and the ground-truth mask, and the Precision@X score built on IoU. Given a threshold X, the Precision@X score computes the percentage of successful predictions that have IoU scores higher than X.

We train and evaluate the proposed approach on three commonly-used referring image segmentation datasets: RefCOCO , RefCOCO+ , and RefCOCOg . Following previous works , the image features are extracted by a Darknet-53 backbone pretrained on MSCOCO dataset and language embeddings are generated by GloVE . Language expressions are padded to 15 words for RefCOCO/RefCOCO+ and 20 words for RefCOCOg. Images from the validation set and test set of the referring segmentation datasets are excluded when training the backbone. Images are resized to 416 $\times$ 416 for CNN backbone following and 480 $\times$ 480 for Transformer backbone following . Channel number $C$ is fixed to 256 for the transformers and 512 for the mask decoder. The network has 2 encoder layers. The head number is 8 for all transformer layers. The weight for mask loss $w_{mask}$ is set to 1 and the Language Feature Reconstruction loss $w_{rec}$ is set to $0.1$ . All linear layers and convolutions layers are followed by a Batch Normalization and ReLU function unless otherwise noticed. The network is trained for 50 epochs with the batch size set to 48, using the Adam optimizer. The learning rate is set to 0.005 with a step decay schedule. We use 4 NVIDIA V100 GPUs for training and testing.

IV-B Ablation Study

We do several ablation experiments to show the effectiveness of each proposed module in our framework. The results are reported in TABLE I and TABLE II.

IV-C Visualizations

In this section, we visualize some sample outputs of our model in Figure 8. To show the superior language understanding performance of our method, we use images and language expressions from the RefCOCOg dataset, of which language expressions are more natural and complex than other datasets. All examples in Figure 8 have long sentences with more than 10 words, and with more than two instances appearing in the text. Example (a) has a difficult sentence and a complicated layout where three mattresses are crowded in a small room. Our model has a good context understanding of the key words “matress”, “pink and yellow”, “blue”, and their relations, and does not be distracted by other mattresses and blue objects. Example (b) has a very long sentence, but most of the information is not discriminative for identifying the target, e.g., both people in the image have short hair and are looking to the side. Our model detects the informative part of the sentence and targets the right object. Example (c) shows that our model can not only identify foreground objects but is also able to detect in the backgrounds. In the language expression of example (d), three objects are mentioned: “a woman”, “a cake”, and “a man”. Our model still managed to find the subject from the difficult sentence and target the instance in the image. Besides, in Figure 9, we show extra examples of using multiple language expressions to refer to different objects in one image. In example (a), it can be seen that our method successfully handles complex relationships and attributes such as “has been almost half eaten” and “with cherry on it”. In example (b) our method can retrieval the correct object from a complex scene. The first expression tells “standing lady” while there are two ladies in the image. Our method found the correct one. The second expression says “man in red sitting”. There are three information in this expression: “man”, “in red”, “sitting”. From the image we can see that all of the three points are necessary to find the target, i.e., the target cannot be determined without any one of the information points. The network have to understand and combine all the information in the expression. This example shows that our network shows impressive performance on establishing the pixel-language correspondence.

IV-D Comparison with State-of-the-Art Methods

We report the experimental results of our method on three datasets, RefCOCO , RefCOCO+ , and RefCOCOg , to compare with previous state-of-the-art methods in TABLE IV. There are two data splitting types for the RefCOCOg dataset. One is referred to as the UMD split and the other is the Google split. The UMD split has both validation set and test set available, while the Google split only has validation set publicly available. We do experiments and report the results on both kinds of splitting. From TABLE IV, it can be seen that our method achieves superior performance on all datasets and outperforms previous state-of-the-art methods. On RefCOCO dataset, our method is $1.5\%-2\%$ better than the previous SOTA, including VLT and LTS . On the other two datasets, our methods also have a consistent improvement of about $1.5\%$ compared with the previous state-of-the-art methods. Besides, for a fair comparison, we also implement our model with the stronger backbone Swin-Transformer. It can be seen that our model with Swin-Transformer backbone also achieves a significant improvement of around $1\%$ across most of the datasets. Especially for RefCOCO+, our model with Swin-Transformer backbone achieves about $2\%$ improvement over the previous SOTA method VLT+. This shows that our model is robust to different backbones and can achieve better performance with stronger backbones.

We also compare the Precision@X scores of the RefCOCO validation set against other methods that have data available, and the results are shown in TABLE V. From the Pr@0.5 row, it can be seen that our model achieves the highest score. Compared with the VLT that also utilizes the transformer model as prediction head, our method has an over $2\%$ higher result in terms of Pr@0.5. The previous state-of-the-art method on the Pr@0.5 metric, MCN , utilizes data from both referring segmentation datasets (segmentation masks) and referring comprehension datasets (bounding boxes) in training for better locating the target, while our model only uses the segmentation mask as ground-truth. But our method achieves better targeting scores on Pr@0.5 with a large margin of 2.41%. We attribute this to the better understanding of the language expression and the denser interaction of the information between the features from two modalities. This shows that our proposed modules leverage the information in the given language expression more effectively, and better fuse them with the vision information.

IV-E Failure Cases

We examine two typical categories of failure cases: (1) instances where the input expression refers to uncommon or unexpected areas. For instance, in example (a), the expression asks us to locate a “gap between newspaper and sandwich,” which, in reality, was a part of the table. Such expressions are atypical and not commonly observed in practical situations. (2) Instances where the expression is ambiguous or seeks an excessive amount of detail. In example (b), the expression “man using oven” was used. From the picture, it was apparent that both men were operating machines in the kitchen, and both machines resembled an oven. As a result, our model highlighted both individuals. Nonetheless, if we look very carefully, the machine on top also seems like a microwave. In such cases, the expression is rather ambiguous, and the model is unable to handle them. Dealing with such situations could be an interesting topic for future research.