GRES: Generalized Referring Expression Segmentation

Chang Liu, Henghui Ding, Xudong Jiang

Introduction

Referring Expression segmentation (RES) is one of the most important tasks of multi-modal information processing. Given an image and a natural language expression that describes an object in the image, RES aims to find this target object and generate a segmentation mask for it. It has great potential in many applications, such as video production, human-machine interaction, and robotics. Currently, most of the existing methods follow the RES rules defined in the popular datasets ReferIt and RefCOCO and have achieved great progress in recent years.

Limitations of classic RES. However, most classic RES methods have some strong pre-defined constraints to the task. First, the classic RES does not consider no-target expressions that do not match any object in the image. This means that the behavior of the existing RES methods is undefined if the target does not exist in the input image. When it comes to practical applications under such constraint, the input expression has to match an object in the image, otherwise problems inevitably occur. Second, most existing datasets, e.g., the most popular RefCOCO , do not contain multi-target expressions that point to multiple instances. This means that multiple inputs are needed to search objects one by one. E.g., in Fig. 1, four distinct expressions with four times of model calls are required to segment “All people”. Our experiments show that classic RES methods trained on existing datasets cannot be well-generalized to these scenarios.

New benchmark and dataset. In this paper, we propose a new benchmark, called Generalized Referring Expression Segmentation (GRES), which allows expressions indicating any number of target objects. GRES takes an image and a referring expression as input, the same as classic RES. Different from classic RES, as shown in Fig. 1, GRES further supports multi-target expression that specifies multiple target objects in a single expression, e.g., “Everyone except the kid in white”, and no-target expression that does not touch on any object in the image, e.g., “the kid in blue”. This provides much more flexibility for input expression, making referring expression segmentation more useful and robust in practice. However, existing referring expression datasets do not contain multi-target expression nor no-target samples, but only have single-target expression samples, as shown in Tab. 1. To facilitate research efforts on realistic referring segmentation, we build a new dataset for GRES, called gRefCOCO. It complements RefCOCO with two kinds of samples: multi-target samples, in which the expression points to two or more target instances in the image, and no-target samples, in which the expression does not match any object in the image.

A baseline method. Moreover, we design a baseline method based on the objectives of the GRES task. It is known that modeling relationships, e.g., region-region interactions, plays a crucial role in RES . However, classic RES methods only have one target to detect so that many methods can achieve good performance without explicit region-region interaction modeling. But in GRES, as multi-target expressions involve multiple objects in one expression, it is more challenging and essential to model the long-range region-region dependencies. From this point, we propose a region-based method for GRES that explicitly model the interaction among regions with sub-instance clues. We design a network that splits the image into regions and makes them explicitly interact with each other. Moreover, unlike previous works where regions come from a simple hard-split of the input image, our network soft-collates features for each region, achieving more flexibility. We do extensive experiments on our proposed methods against other RES methods, showing that the explicit modeling of interaction and flexible region features greatly contributes to the performance of GRES.

In summary, our contributions are listed as follows:

We propose a benchmark of Generalized Referring Expression Segmentation (GRES), making RES more flexible and practical in real-world scenarios.

We propose a large-scale GRES dataset gRefCOCO. To the best of our knowledge, this is the first referring expression dataset that supports expressions indicating an arbitrary number of target objects.

We propose a solid baseline method ReLA for GRES to model complex ReLAtionships among objects, which achieves the new state-of-the-art performance on both classic RES and newly proposed GRES tasks.

We do extensive experiments and comparisons of the proposed baseline method and other existing RES methods on the GRES, and analyze the possible causes of the performance gap and new challenges in GRES.

Related Works

Related referring tasks and datasets. Being defined by Hu et al. , Referring Expression Segmentation (RES) comes from a similar task, Referring Expression Comprehension (REC) that outputs a bounding box for the target. The earliest dataset for RES and REC is ReferIt , in which one expression only refers to one instance. Later, Yu et al. propose RefCOCO for RES and REC. However, like ReferIt, it only contains single-target expressions. Another popular dataset RefCOCOg also inherits this. Although the original definition of RES does not limit the number of target instances, “one expression, one instance” has become a “de-facto” rule for RES task.

Recently, some new datasets are proposed, but most of them are neither focused on nor suitable for GRES. E.g., although PhraseCut has multi-target expressions, it only considers them as “fallback”, i.e., multi-target expressions are only used when an object cannot be uniquely referred to. In contrast, our expression intentionally finds multiple targets. Besides, expressions in PhraseCut are written using templates rather than free natural language expressions, limiting the diversity of language usage. Image caption datasets are close to RES, but they cannot ensure unambiguity of expression $\rightarrow$ object(s). Thus, they are not suitable for referring-related tasks. There are some referring datasets using other data modalities or learning schemes, e.g., Scanrefer focuses on 3D objects and Clevrtex focuses on unsupervised learning. Moreover, none of the above datasets has no-target expression.

Referring segmentation methods. Referring segmentation methods can be roughly divided into two categories: one-stage (or top-down) methods and two-stage (or bottom-up) methods . One-stage methods usually have an FCN-like end-to-end network, and the prediction is achieved by per-pixel classification on fused multi-modal feature. Two-stage methods first find a set of instance proposals using an out-of-box instance segmentation network and then select the target instance from them. The majority of RES methods are one-stage, while two-stage methods are more prevalent in REC . Most recently, some transformer-based methods are proposed and bring large performance gain compared to the CNN-based network. Zero-shot segmentation methods use class names as textual information and focus on identifying novel categories, in contrast to RES that employs natural expressions to identify the user’s intended target.

Task Setting and Dataset

Revisit of RES. Classic Referring Expression Segmentation (RES) takes an image and an expression as inputs. The desired output is a segmentation mask of the target region that is referred by the input expression. As discussed in Sec. 2, the current RES does not consider no-target expressions, and all samples in current datasets only have single-target expressions. Thus, existing models are likely to output an instance incorrectly if the input expression refers to nothing or multiple targets in the input image.

Generalized RES. To address these limitations in classic RES, we propose a benchmark called Generalized Referring Expression Segmentation (GRES) that allows expressions indicating arbitrary number of target objects. A GRES data sample contains four items: an image $I$ , a language expression $T$ , a ground-truth segmentation mask $M_{\mathit{GT}}$ that covers pixels of all targets referred by $T$ , and a binary no-target label $E_{\mathit{GT}}$ that indicates whether $T$ is a no-target expression. The number of instances in $T$ is not limited. GRES models take $I$ and $T$ as inputs and predict a mask $M$ . For no-target expressions, $M$ should be all negative.

The applications of multi-target and no-target expressions are not only finding multiple targets and rejecting inappropriate expressions matching nothing, but also bringing referring segmentation into more realistic scenarios with advanced usages. For example, with the support of multi-target expressions, we can use expressions like “all people” and “two players on left” as input to select multiple objects in a single forward process (see Fig. 2(a)), or use expressions like “foreground” and “kids” to achieve user-defined open vocabulary segmentation. With the support of no-target expressions, users can apply the same expression on a set of images to identify which images contain the object(s) in the language expression, as in Fig. 2(b). This is useful if users want to find and matte something in a group of images, similar to image retrieval but more specific and flexible. What’s more, allowing multi-target and no-target expressions enhances the model’s reliability and robustness to realistic scenarios where any type of expression can occur unexpectedly, for example, users may accidentally or intentionally mistype a sentence.

Evaluation. To encourage the diversity of GRES methods, we do not force GRES methods to differentiate different instances in the expression though our dataset gRefCOCO provides, enabling popular one-stage methods to participate in GRES. Besides the regular RES performance metric cumulative IoU (cIoU) and Precision@X, we further propose a new metric called generalized IoU (gIoU), which extends the mean IoU to all samples including no-target ones. Moreover, No-target performance is also separately evaluated by computing No-target-accuracy (N-acc.) and Target-accuracy (T-acc.). Details are given in Sec. 5.1.

2 gRefCOCO: A Large-scale GRES Dataset

To perform the GRES task, we construct the gRefCOCO dataset. It contains 278,232 expressions, including 80,022 multi-target and 32,202 no-target expressions, referring to 60,287 distinct instances in 19,994 images. Masks and bounding boxes for all target instances are given. Part of single-target expressions are inherited from RefCOCO. We developed an online annotation tool to find images, select instances, write expressions, and verify the results.

The basic annotation procedure follows ReferIt to ensure the annotation quality. The data split is also kept the same as the UNC partition of RefCOCO . We compare the proposed gRefCOCO with RefCOCO and list some unique and significant features of our dataset as follows.

Multi-target samples. In practice, users usually cluster multiple targets of an image by describing their logical relationships or similarities. From this point, we let annotators select target instances rather than randomly assembling them. Then annotators write an unambiguous referring expression for the selected instances. There are four major features and challenges brought by multi-target samples:

1) Usage of counting expressions, e.g., “The two people on the far left” in Fig. 3(a). As the original RefCOCO already has ordinal word numbers like “the second person from left”, the model must be able to differentiate cardinal numbers from ordinal numbers. Explicit or implicit object-counting ability is desired to address such expressions.

2) Compound sentence structures without geometrical relation, like compound sentences “A and B”, “A except B”, and “A with B or C”, as shown in Fig. 3. This raises higher requirements for models to understand the long-range dependencies of both the image and the sentence.

3) Domain of attributes. When there are multiple targets in an expression, different targets may share attributes or have different attributes, e.g., “the right lady in blue and kid in white”. Some attributes may be shared, e.g., “right”, and others may not, e.g., “blue” and “white”. This requires the model to have a deeper understanding of all the attributes and map the relationship of these attributes to their corresponding objects.

4) More complex relationships. Since a multi-target expression involves more than one target, relationship descriptions appear more frequently and are more complicated than in sing-target ones. Fig. 3(b) gives an example. Two similar expressions are applied on the same image. Both expressions have the conjunction word “and”, and “two passengers” as an attribute to the target “bike”. But the two expressions refer to two different sets of targets as shown in Fig. 3(b). Thus in GRES, relationships are not only used to describe the target but also indicate the number of targets. This requires the model to have a deep understanding of all instances and their interactions in the image and expression.

No-target samples. During the annotation, we found that if we do not set any constraints for no-target expressions, annotators tend to write a lot of simple or general expressions that are quite different from other expressions with valid targets. E.g., annotators may write duplicated “dog” for all images without dogs. To avoid these undesirable and purposeless samples in the dataset, we set two rules for no-target expressions:

1) The expression cannot be totally irrelevant to the image. For example, given the image in Fig. 3(a), “The kid in blue” is acceptable as there do exist kids in the image, but none of them is in blue. But expressions like “dog”, “car”, “river” etc. are unacceptable as they are totally irrelevant to anything in this image.

2) The annotators could choose a deceptive expression drawn from other images in RefCOCO’s same data split, if an expression required by in 1) is hard to come up with.

These rules greatly improve the diversity of no-target expressions and keep our dataset at a reasonable difficulty. More examples are shown in the Supplementary Materials.

The Proposed Method for GRES

As discussed earlier, the relationship and attribute descriptions are more complex in multi-target expressions. Compared with classic RES, it is more challenging and important for GRES to model the complex interaction among regions in the image, and capture fine-grained attributes for all objects. We propose to explicitly interact different parts of image and different words in expression to analyze their dependencies.

Outputs and Loss. The predicted mask $M$ is supervised by the ground-truth target mask $M_{\mathit{GT}}$ . The $P\times P$ probability map $x_{r}$ is supervised by a “minimap” downsampled from $M_{\mathit{GT}}$ , so that we can link each region with its corresponding patch in the image. Meantime, we take the global average of all region features $F_{r}$ to predict a no-target label $E$ . In inference, if $E$ is predicted to be positive, the output mask $M$ will be set to empty. $M$ , $x_{r}$ and $E$ are guided by the cross-entropy loss.

2 ReLAtionship Modeling

The proposed ReLAtionship modeling has two main modules, Region-Image Cross Attention (RIA) and Region-Language Cross Attention (RLA). The RIA flexibly collects region image features. The RLA captures the region-region and region-language dependency relationships.

Region-Language Cross Attention (RLA). Region image features $F^{\prime}_{r}$ come from collating image features that do not contain relationship between regions and language information. We propose RLA module to model the region-region and region-language interactions. As in LABEL:sub@fig:rla, RLA consists of a self-attention for region image features $F^{\prime}_{r}$ and a multi-modal cross attention. The self-attention models the region-region dependency relationships. It computes the attention matrix by interacting one region feature with all other regions and outputs the relationship-aware region feature $F_{r1}$ . Meanwhile, the cross attention takes language feature $F_{t}$ as Value and Key input, and region image feature $F^{\prime}_{r}$ as Query input. This firstly models the relationship between each word and each region:

Experiments and Discussion

Besides the widely-used RES metrics cumulative IoU (cIoU) and Precision@X (Pr@X), we further introduce No-target accuracy (N-acc.), Target accuracy (T-acc.), and generalized IoU (gIoU) for GRES.

cIoU and Pr@X. cIoU calculates the total intersection pixels over total union pixels, and Pr@X counts the percentage of samples with IoU higher than the threshold $X$ . Notably, no-target samples are excluded in Pr@X. And as multi-target samples have larger foreground areas, models are easier to get higher cIoU scores. Thus, we raise the starting threshold to 0.7 for Pr@X.

N-acc. and T-acc. evaluates the model’s performance on no-target identification. For a no-target sample, prediction without any foreground pixels is true positive ( $\mathit{TP}$ ), otherwise false negative ( $\mathit{FN}$ ). Then, N-acc. measures the model’s performance on identifying no-target samples: N-acc. = $\frac{\mathit{TP}}{\mathit{TP}+\mathit{FN}}$ . T-acc. reflects how much the generalization on no-target affects the performance on target samples, i.e. how many samples that have targets are misclassified as no-target: T-acc. = $\frac{\mathit{TN}}{\mathit{TN}+\mathit{FP}}$ .

gIoU. It is known that cIoU favors larger objects . As multi-target samples have larger foreground areas in GRES, we introduce generalized IoU (gIoU) that treats all samples equally. Like mean IoU, gIoU calculates the mean value of per-image IoU over all samples. For no-target samples, the IoU values of true positive no-target samples are regarded as 1, while IoU values of false negative samples are treated as 0.

2 Ablation Study

Dataset necessity. To show the necessity and validity of gRefCOCO on the task of GRES, we compare the results of the same model trained on RefCOCO and gRefCOCO. As shown in Fig. 6, image (a) is a multi-target sample using a shared attribute (“in black jacket”) to find “two guys”. The model trained on RefCOCO only finds one, even though the expression explicitly points out that there are two target objects. Image (b) gives a no-target expression, and the RefCOCO-trained model outputs a meaningless mask. The results demonstrate that models trained only on single-target referring expression datasets, e.g., RefCOCO, cannot be well generalized to GRES. In contrast, the newly built gRefCOCO can effectively enable the model to handle expressions indicating an arbitrary number of objects.

Design options of RIA. In Tab. 2, we investigate the performance gain brought by RIA. In model #1, we follow previous methods and rigidly split the image into $P\!\times\!P$ patches before sending them into the encoder. Tab. 2 shows that this method is not suitable for our ReLA framework, because it makes the global image information less pronounced due to compromised integrity. In model #2, RIA is replaced by average pooling the image feature into $P\!\times\!P$ . The gIoU gets a significant gain of $5.59\%$ from model #1, showing the importance of global context in visual feature encoding. Then, another $2.67\%$ gIoU gain can be got by adding our proposed dynamic region feature aggregation for each query (Eq. (2)), showing the effectiveness of the proposed adaptive region assigning. Moreover, we study the importance of linking queries with actual image regions. In model #3, we removed the minimap supervision so that the region-based queries $Q_{r}$ become plain learnable queries, resulting in a $1.54\%$ gIoU drop. This shows that explicit correspondence between queries and spatial image regions is beneficial to our network.

Design options of RLA. Tab. 3 shows the importance of dependency modeling to GRES. In the baseline model, RLA is replaced by point-wise multiplying region features and globally averaged language features, to achieve a basic feature fusion like previous works . In model #2, the language cross attention is added onto the baseline model, which brings a gIoU gain of $2\%$ . This shows the validity of region-word interaction modeling. Then we further add the region self-attention to investigate the importance of the region-region relationship. The region-region relationship modeling brings a performance gain of $3.85\%$ gIoU. The region-region and region-word relationship modeling together bring a significant improvement of $5.07\%$ gIoU.

Number of regions $P$ . Smaller $P$ leads to coarser regions, which is not good for capturing fine-grained attributes, while larger $P$ costs more resources and decreases the area of each region, making relationship learning difficult. We do experiments on the selection of $P$ in Tab. 4 to find the optimized $P$ . The model’s performance improves as $P$ increases until $10$ , which is selected as our setting. In Fig. 7, we visualize the predicted minimap $x_{r}$ and region maps $M_{r}$ . $x_{r}$ displays a rough target probability of each region, showing the effectiveness of minimap supervision. We also see that the region masks capture the spatial correlation of the corresponding regions. With flexible region size and shape, each region mask contains not only the instance of this region but also other instances with strong relationships. For example, region #4 is located inside the bottom lunch box, but as the input expression tells that all three boxes are targets, the top two also cause some responses in the output mask of region #4.

3 Results on GRES

Comparison with state-of-the-art RES methods. In Tab. 5, we report the results of classic RES methods on gRefCOCO. We re-implement these methods using the same backbone as our model and train them on gRefCOCO. For one-stage networks, output masks with less than 50 positive pixels are cleared to all-negative, for better no-target identification. For the two-stage network MattNet , we let the model predict a binary label for each instance that indicates whether this candidate is a target, then merge all target instances. As shown in Tab. 5, these classic RES methods do not perform well on gRefCOCO that contains multi-target and no-target samples. Furthermore, to better verify the effectiveness of explicit modeling, we add our ReLA on VLT and LAVT to replace the decoder part of them. From Tab. 5, our explicit relationship modeling greatly enhances model’s performance. E.g., adding ReLA improves the cIoU performance of the LAVT by more than $4\%$ on the val set.

In Tab. 6, we test the no-target identification performance. As shown in the table, T-acc. of all methods are mostly higher than $95\%$ , showing that our gRefCOCO does not significantly affect the model’s targeting performance while being generalized to no-target samples. But from N-acc. of classic RES methods, we see that even being trained with no-target samples, it is not satisfactory to identify no-target samples solely based on the output mask. We also tested our model with the no-target classifier disabled and only use the positive pixel count in the output mask to identify no-target samples (“ReLA-50pix” in Tab. 6). The performance is similar to other methods. This shows that a dedicated no-target classifier is desired. However, although our N-acc. is higher than RES methods, there are still around $40\%$ of no-target samples are missed. We speculate that this is because many no-target expressions are very deceptive and similar with real instances in the image. We believe that no-target identification will be one of our key focus on the future research for the GRES task.

Qualitative results. Some qualitative examples of our model on the val set of gRefCOCO are shown in Fig. 8. In Image (a), our model can detect and precisely segment multiple targets of the same category (“girls”) or different categories (“girls and the dog”), showing the strong generalization ability. Image (b) uses counting words (“two bowls”) and shared attributes (“on right”) to describe a set of targets. Image (c) has a compound sentence showing that our model can understand the excluding relationship: “except the blurry guy” and makes a good prediction.

Failure cases & discussion. We show some failure cases of our method in Fig. 9. Image (a) introduces a possession relationship: “left girl and her laptop”. This is a very deceptive case. In the image, the laptop in center is more dominant and closer to the left girl than the left one, so the model highlighted the center laptop as “her laptop”. Such a challenging case requires the model to have a profound understanding of all objects, and a contextual comprehension of the image and expression. In the second case, the expression is a no-target expression, referring to “man in gray shirt sitting on bed”. In the image, there is indeed a sitting person in grey shirt, but he is sitting on a black chair very close to the bed. This further requires the model to look into the fine-grained details of all objects, and understand those details with image context.

4 Results on Classic RES

We also evaluate our method on the classic RES task and report the results in Tab. 7. In this experiment, our model strictly follows the setting of previous methods and is only trained on the RES datasets. As shown in Tab. 7, the proposed approach ReLA outperforms other methods on classic RES. Our performance is consistently higher than the state-of-the-art LAVT with a margin of 1% $\sim$ 4% on three datasets. Although the performance gain of our proposed method over other methods on classic RES is lower than that on GRES, the results show that the explicit relationship modeling is also beneficial to classic RES. More results are reported in Supplementary Materials.

Conclusion

We analyze and address the limitations of the classic RES task, i.e., it cannot handle multi-target and no-target expressions. Based on that, a new benchmark, called Generalized Referring Expression Segmentation (GRES), is defined to allow an arbitrary number of targets in the expressions. To support the research on GRES, we construct a large-scale dataset gRefCOCO. We propose a baseline method ReLA for GRES to explicitly model the relationship between different image regions and words, which consistently achieves new state-of-the-art results on the both classic RES and newly proposed GRES tasks. The proposed GRES greatly reduces the constraint to the natural language inputs, increases the application scope to the cases of multiple instances and no right objects in image, and opens possible new applications such as image retrieval.