ReSTR: Convolution-free Referring Image Segmentation Using Transformers

Namyup Kim, Dongwon Kim, Cuiling Lan, Wenjun Zeng, Suha Kwak

Introduction

Throughout the recent years, there have been witnessed remarkable advances in semantic segmentation in terms of both efficacy and efficiency . However, its application to real-world downstream tasks is still limited. Since the task is designed to deal with only a predefined set of classes (e.g., “car”, “person”), semantic segmentation models are hard to address undefined classes and specific entities of user interest (e.g., “a red Ferrari”, “a man wearing a blue hat”).

Referring image segmentation has been studied to resolve this limitation by segmenting an image region corresponding to a natural language expression given as query. As this task is no longer restricted by predefined classes, it enables a large variety of applications such as human-robot interaction and interactive photo editing. Referring image segmentation is however more challenging than semantic segmentation since it demands to comprehend individual entities and their relations expressed in the language expression (e.g., “a car behind the taxi next to the building”), and to fully exploit such structured and relational information in the segmentation process. For this reason, models for the task should be capable of capturing interactions between semantic entities in both modalities as well as joint reasoning over the two different modalities.

Existing methods for referring image segmentation have adopted convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to extract visual and linguistic features, respectively. In general, these features are integrated into a multimodal feature map through convolution layers applied to a concatenation of the two features, so-called concatenation-convolution operation. On top of the multimodal feature map, recent methods further employ attention mechanisms so that the feature map effectively captures interactions between semantic entities. The final multimodal features are then fed as input to a segmentation module.

Although these methods have shown remarkable results on the challenging task, they share the following limitations. First, they have trouble handling long-range interactions between semantic entities within each modality. Referring image segmentation requires to capture such interactions since language expressions often involve complicated relations between entities to precisely indicate target region. In this aspect, both of CNNs and RNNs are limited due to the locality of their basic building blocks. Second, existing models have difficulty in modeling sophisticated interactions between the two modalities. They aggregate visual and linguistic features through the concatenation-convolution operation, which is a fixed and handcrafted way of feature fusion and thus could not be sufficiently flexible and effective to handle a large variety of referring image segmentation scenarios.

To overcome the aforementioned limitations, we propose the first convolution-free model for Referring image Segmentation using TRansformers, dubbed ReSTR. Its overall pipeline is illustrated briefly in Fig. 1. First of all, ReSTR extracts visual and linguistic features through transformer encoders . The two encoders, namely vision encoder and language encoder, take a set of non-overlapped image patches and that of word embeddings as input, respectively, and extract their features while considering their long-range interactions within each modality. By using transformers for both modalities, we take advantage of capturing global context from the beginning of feature extraction and unifying network topology for the two modalities .

Next, a self-attention encoder aggregates the visual and linguistic features into a patch-wise multimodal features. This multimodal fusion encoder enables sophisticated and flexible interactions between features of the two modalities thanks to its self-attention layers. Moreover, the fusion encoder takes a class seed embedding as another input. The class seed embedding is transformed adaptively by the fusion encoder to a classifier for the target entity described in the language expression.

Finally, the outputs of the multimodal fusion encoder, i.e., the patch-wise multimodal features and the adaptive classifier, are fed as input to the segmentation decoder. The decoder computes the final segmentation map in a coarse-to-fine manner. The adaptive classifier is first applied to each multimodal feature as a classifier to examine whether each image patch contains a part of target entity. The coarse, patch-level prediction is then converted into a pixel-level segmentation map by a series of upsampling and linear layers. Thanks to the powerful transformer encoders, this simple and efficient decoder is able to produce accurate segmentation results, achieving the state of the art on four public benchmarks for referring image segmentation.

In summary, the contribution of this work is three-fold:

Our network is the first convolution-free architecture for referring image segmentation. It captures long-range interactions between vision and language modalities and unifies the network topology for the two different modalities by transformers.

To encode the fine comprehension of the two modalities, we carefully design the multimodal fusion encoder with the class seed embedding which is transformed to an adaptive classifier for referring image segmentation.

ReSTR achieves the state of the art on four public benchmarks without bells and whistles.

Related Work

Semantic segmentation has been significantly improved with the emergence of deep neural networks. Based on a Fully Convolutional Network (FCN) for pixel-level prediction on an end-to-end framework, many approaches are proposed to overcome the several limitations of the network. Since FCN predicts a coarse output mask, the early approaches focus on performing high-resolution predictions. The former studies propose methods to extend the receptive field of CNN by dilated convolutions and to capture multiscale contexts by a feature pyramid pooling scheme . The several approaches propose encoder-decoder structures to model coarse-to-fine framework by multi-level feature fusion. Recently, semantic segmentation has been studied to capture contextual information by attention mechanism .

However, the mentioned methods have used variants of FCN architecture that limit local context encoding by convolutional layers. Moreover, since this task is defined to predict segmentation masks within only a predefined set of classes, semantic segmentation models have limitations to apply to real applications.

2 Referring Image Segmentation

In contrast to predefined pixel-level classification as semantic segmentation, referring image segmentation aims at grouping the pixels as mask corresponded to a given natural language expression. The pioneering work proposes extracting visual and linguistic features from CNN and RNN, respectively, and generating multimodal features by concatenating tiled linguistic features and visual feature maps. Based on this framework, the early approaches suggest the methods to perform high-resolution prediction by ConvLSTM and the encoder-decoder architecture by intermediate connections . Follow-up studies propose an attention mechanism to fuse the visual and linguistic features and multi-level feature aggregation to produce high-resolution segmentation maps . Recent studies suggest a multimodal fusion encoder using transformers to capture long-range interactions between visual and linguistic features.

Unlike the existing work, we propose a new convolution-free architecture to encode contextual information at every stage of our model and efficiently transform a patch-level prediction to a high-resolution segmentation map in a coarse-to-fine manner.

3 Vision Transformer

From the introduction of transformers by as a self-attention module for NLP, many approaches adopt this module in computer vision tasks for the advantages of this module including long-range dependencies, dynamic kernel depended on input, and less visual inductive bias than CNNs. Several studies employ transformers for an attention module in/on CNNs as a CNN-transformer hybrid network . Recent approaches replace CNNs with transformers as a convolution-free architecture in image classification , object detection , semantic segmentation and multimodal learning . In particular, transformers are deployed to semantic segmentation tasks to overcome the inherent limitation of FCN-like architecture. For example, Zheng et al. utilize transformer backbone as a global context feature extractor and then convolutional layers for a decoder in the hybrid manner. Strudel et al. propose a convolution-free architecture for semantic segmentation by self-attention with visual features and a set of learnable classes embeddings. Inspired by the paradigm, we adopt transformers for referring image segmentation for the above advantages and use an adaptive classifier as an extension of the learnable class queries used in semantic segmentation transformers .

Proposed Method

This section elaborates on ReSTR, our convolutional-free transformer network for referring image segmentation. Its detailed architecture is illustrated in Fig. 2. To capture long-range interactions for each modality, ReSTR first extracts visual and linguistic features by transformer encoders independently (Sec. 3.1). Then, it forwards visual and linguistic features in parallel to a multimodal fusion encoder to capture fine relations across these two modalities (Sec. 3.2). Finally, an efficient decoder for a coarse-to-fine segmentation converts patch-level prediction into high-resolution pixel-level prediction (Sec. 3.3).

To extract visual and linguistic features, we choose transformers for both modalities. A transformer encoder is $M$ sequential transformers, each of which consists of Multi-headed Self-Attention (MSA), Layer Normalization (LN), and Multilayer Perceptron (MLP) blocks:

where $\boldsymbol{\theta}_{v}$ are the parameters of the vision encoder.

2 Multimodal Fusion Encoder

Then, we feed the class seed embedding $\mathbf{e}_{s}$ and the visual-attended linguistic features $\mathbf{z}_{l}^{\prime}$ into the linguistic-seed encoder:

The multimodal fusion encoder is designed to produce the adaptive classifier that satisfies the following two requirements demanded in referring image segmentation. First, since referring image segmentation aims to segment a region corresponding to a language expression, the adaptive classifier should comprehend fine relations of the language expression. Moreover, since an input image has regions irrelevant to the language expression (e.g., background), the class seed embedding directly attending to the visual information can lead to an adaptive classifier corrupted by the irrelevant regions. Nevertheless, since the appearance of the target entity described in a language expression can differ by images, it is beneficial to produce the adaptive classifier using the visual-attended linguistic features.

Therefore, we build the multimodal fusion encoder using these two transformer encoders alternatively to generate the adaptive classifier that meets the aforementioned conditions. We empirically verify the superiority of our multimodal fusion encoder in Sec. 4.3.

3 Coarse-to-Fine Segmentation Decoder

where $\sigma$ is the sigmoid function and $\sqrt{D}$ is a normalization factor .

where $\mathbf{y}^{i}_{p}$ denotes the patch-level labels of the $i$ -th patch $p_{i}$ , $j$ is the number of pixels in a patch, $h(\cdot)$ indicates the average pooling over spatial dimension, and $\tau$ is a thresholding hyperparameter.

The network is trained by the binary cross-entropy loss $\mathcal{L}_{b}(\hat{Y},Y)$ on the patch-level prediction $\mathbf{\hat{y}}_{p}$ and the pixel-level prediction $\hat{Y}_{m}$ :

where $\lambda$ is a balancing hyperparameter.

Experiments

Datasets. We conduct experiments on four datasets, ReferIt , UNC , UNC+ , and Gref , which are widely used in referring image segmentation task. ReferIt contains 19,894 images with 130,525 language expressions for 96,654 masks which are collected from IAPR TC-12 . UNC, UNC+, and Gref are collected from COCO dataset. UNC and UNC+ consist of 19,994 images with 142,209 language expressions for 50,000 masks and 19,992 images with 141,564 language expressions for 49,856 masks, respectively. The difference between UNC and UNC+ is that UNC+ does not contain the words that indicate location properties (e.g., left, top, front) in expressions and contains the only appearance expressions. Gref contains 25,711 images with 104,560 language expressions for 54,822 objects.

Implementation details. We use ViT-B-16 pretrained on ImageNet-21K for the vision encoder which has 12 layers, 16 patch size, 768 channel dimensions, 12 heads of MSA, and 3,072 dimensions of channel expansion in MLP. We use pretrained GloVe embeddings for language expressions. The language encoder consists of 6 transformer layers, and has 300 channel dimensions as GloVe embeddings, 12 heads of MSA and 3,072 dimensions of channel expansion in MLP. The maximum length of a language expression $N_{l}$ is set to 20 following previous work. The multimodal fusion encoder consists of the same transformer as the vision encoder. The number of layers of the segmentation decoder is 4 since the patch size is 16. In all experiments, the models are optimized by AdamW with weight decay of $5e-4$ ; the initial learning rate is $1e-5$ and decreases with polynomial decay . We set a batch size of 8 and train for 400,000 iterations with warm-up period for 40,000 iterations to reach the initial learning rate. We resize input images to $480\times 480$ . We set $\tau$ in Eq. (11) and $\lambda$ in Eq. (12) to 0.8 and 0.1 for all experiments, respectively.

Evaluation protocol. Following previous work , we adopt the cumulative Intersection-over-Union (IoU) metric, where total intersections are divided by the total unions over all test samples. Then, we evaluate the accuracy at the $\{0.5,0.6,0.7,0.8,0.9\}$ IoU thresholds.

2 Comparisons with the State of the Art

We compare ReSTR with other referring image segmentation models on four benchmarks. As summarized in Table 1, ReSTR achieves outstanding performance without inefficient postprocessing (e.g., DenseCRF ) compared with the previous arts on all public benchmarks except for UNC+ testB set. Following , we discuss the relationship between language expression length and performance as summarized in Table 2. The results demonstrate the ReSTR clearly outperforms previous methods on most groups of expression length except for the 1-5 length group on Gref val set. Moreover, the performance of ACM using an attention mechanism for long-range interactions between the two modalities drops 13.71%p from 1-5 to 11-20 length group on the Gref val set, while that of ReSTR drops by 6.81%p. It demonstrates that our method is better to capture the long-range interactions between the two modalities compared to previous methods. Note that the recent methods use a visual backbone pretrained on COCO object detection dataset and evaluate their models on only three benchmarks based on COCO dataset. In contrast, our visual backbone is pretrained for ImageNet classification, and ReSTR is evaluated on all benchmarks.

3 Analysis of Variants of Fusion Encoder

To verify our design choice for the multimodal fusion encoder, we investigate variants of the fusion encoder. We use 4 transformer layers, denoted as $\{f_{1},f_{2},f_{3},f_{4}\}$ , in all variants of the encoder.

To resolve this problem, we consider disconnecting interactions between the visual features and the class seed embedding as illustrated in Fig. 3(b), denoted as Independent Multimodal Encoder (IME). In other words, the class seed embedding interacts with only the linguistic features. Therefore, IME restricts the class seed embedding from being adaptively transformed to an adaptive classifier with the visual information.

To this end, we propose a structure that indirectly conjugates the class seed embedding and the visual features with the linguistic features as medium, denoted as indirect Conjugating Multimodal Encoder (CME) as illustrated in Fig. 3(c). As mentioned in Sec. 3.2, the design aims to avoid interaction between the irrelevant visual features and the class seed embedding by indirectly interactions via the linguistic features. Furthermore, CME produces the adaptive classifier for the target entity described in the language expression by fine interactions between the linguistic features and the class seed embedding.

As summarized in Table 3(b), we compare the three variants of the multimodal fusion encoder on performance, computational cost (MACs), and the number for parameters (# params) of these encoders without the segmentation decoder. These results demonstrate the superiority of CME over the other variants of the fusion encoder in performance and efficiency. In addition, we also experiment CME with weight sharing (CME†) between transformer layers of the visual-linguistic encoder and between those of the linguistic-seed encoder. The result shows CME† is still better performance with lower parameters and computational cost than the other variants.

4 In-depth Analysis of ReSTR

We investigate our framework on the val set of Gref dataset which contains the longer and more complicated language expressions than the others.

Effect of the number of transformer layers in the multimodal fusion encoder. We study the impact of the number of transformer layers in the multimodal fusion encoder by varying the number of transformers to {2, 4, 6}. Since the multimodal fusion encoder is composed of two transformer encoders, the encoder always has an even number of the transformer layers. As summarized in Table 4, the performance is fairly increased until using 4 transformer layers and marginally increased using 6 transformer layers.

Effect of the segmentation decoder. We investigate the contribution of the segmentation decoder. As summarized in Table 4, the decoder improves IoU by 1.67%p when used along with the 4 transformer layers in the fusion encoder. However, when with the fusion encoder with 2 transformer layers, the improvement made by the segmentation decoder is only 0.31%p. When coupled with the shallow fusion encoder that produces relatively larger potion of false patch-level predictions, the effect of the segmentation decoder is marginal since it is trained to refine the mask of the positive patches. The results demonstrate that the decoder is specialized to refine a patch-level prediction to a fine pixel-level prediction. Note that the analysis of the segmentation decoder is examined except for the fusion encoder with 6 transformer layers due to the memory shortage.

Effect of weight sharing. In Table 4, we also present the performance of the model with weight sharing. Using weight sharing, the number of parameters remains the same regardless of the number of transformer layers that the multimodal fusion encoder contains. The results show that the performance degradation incurred by weight sharing is marginal. It demonstrates that ReSTR could be used in an efficient manner with little loss of performance using weight sharing.

Qualitative analysis. As illustrated on Fig. 4, the patch-level predictions of ReSTR are roughly localized on the target patches and the boundaries of relational objects. Then, the patch-level predictions are transformed to fine pixel-level predictions by the segmentation decoder in a coarse-to-fine manner. Moreover, in Fig. 5, we provide the visualization examples of the predictions when varying language expressions are given as queries. These visualizations show that ReSTR is able to predict the segmentation masks corresponding to different language expressions on an image.

Computation cost analysis. In Table 5, we present the number of parameters and MACs of ReSTR and recent studies whose codes are publicly available. ReSTR achieves the best accuracy with the least computation since it employs the efficient segmentation decoder. Also, the size of the visual feature used in previous work is 4 times bigger than ours.

Conclusion

We have proposed ReSTR, the first convolution-free model for referring image segmentation. ReSTR adopts transformers for both visual and linguistic modalities to capture global context from feature extraction. It also includes the multimodal fusion encoder composed of transformers to encode fine and flexible interactions between these features of the two modalities. Also, the multimodal fusion encoder computes an adaptive classifier for patch-level classification. Furthermore, we have proposed a segmentation decoder to refine the patch-level predictions to the pixel-level prediction in a coarse-to-fine manner. ReSTR outperformed the existing referring image segmentation techniques on all public benchmarks. The fact that computational cost quadratically increases as patch size decreases is the potential limitation of our work. Since the performance of the dense prediction tasks heavily depends on the patch size when using the visual transformer , it introduces an undesirable trade-off between performance and computational cost. To alleviate this, integrating linear-complexity transformer architectures would be a promising research direction, which we leave for future work.

Acknowledgement. We thank Manjin Kim and Sehyun Hwang for fruitful discussions. This work was supported by MSRA Collaborative Research Program, and the NRF grant and the IITP grant funded by Ministry of Science and ICT, Korea (NRF-2021R1A2C3012728, IITP-2020-0-00842, No.2019-0-01906 Artificial Intelligence Graduate School Program–POSTECH).

References

Appendix A Impact of the length of language expression

We present the detail analysis of performance according to language expression length in Fig. A1. Following , each test set on the dataset is split into four groups in terms of language expression length (i.e. sentence length), and each group is roughly equal size. Our method outperforms most previous methods except on the 1-5 length group of the Gref dataset, where the gap is marginally 1.2%p. Furthermore, our method has less performance degradation from the shortest to the longest sentence length group on four datasets than ACM , which is the most recent work. Although ACM is proposed to capture long-range dependencies, it still seems to struggle to understand the complex interaction between words of long language expressions. Therefore, the performance improvement of ACM mostly comes from its performance on the short sentence length groups. However, ReSTR shows the improvement of performance on most groups, which suggests that our model captures better long-range interactions of the language expression than the previous work.

Appendix B Sensitivity to hyperparameters

We investigate the effect of the two hyperparameters, the loss balancing weights $\lambda$ and the thresholding value $\tau$ , to generate patch-level labels. The results of our analysis are summarized in Fig. A2, in which we examine IoU of ReSTR by varying the values of the hyperparameters $\lambda\in\{0.01,0.05,0.1,0.5,1\}$ and $\tau\in\{0.5,0.6,0.7,0.8,0.9\}$ . The results suggest that when $\lambda$ is between 0.05 and 0.5, the performance of ReSTR is high and stable, thus insensitive to the hyperparameter setting. Note that the hyperparameter setting of ReSTR reported in the main paper (the underlined performance in Fig. A2) is not the best, although it outperforms all existing methods, as we do not tune the hyperparameters to optimize the test performance.

Appendix C More qualitative results

In Fig. A3 and Fig. A4, qualitative results of ReSTR on the Gref dataset are presented. Pixel-level predictions and the results post-processed with DenseCRF are provided together. The results show that ReSTR successfully segments masks of the target entities described in various language expressions. For example, ReSTR predicts accurate masks for language expressions about non-human objects (Fig. A3), partially appeared objects (rows 1-3 in Fig. A4), and occluded objects (rows 5-7 in Fig. A4). Moreover, the qualitative results of the pixel-level prediction show that the segmentation decoder of ReSTR produces the fine-grained prediction as well as removes false positives in the patch-level prediction. As shown in Fig. A5, we also present more qualitative results of ReSTR according to varying language expressions for each image. The results show that ReSTR can comprehend various types of objects (rows 1-2 in Fig. A5), a sense of locality (rows 3-4 in Fig. A5), and fine-grained details (row 5 in Fig. A5).