OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan

Introduction

With the development of transformer models , recent works in both natural language processing (NLP) and computer vision raise one common trend: adopting one unified model to solve multiple tasks. For example, large language models (LLMs) adopt scale-up models to solve multiple NLP tasks and achieve better results than previous expert models. In vision, we have also seen a similar trend , adopting one model to solve multiple tasks or sub-tasks, including detection, segmentation, video analysis, low-level vision, pose estimations, and more tasks. Different methods adopt different transformer designs, including visual-in-context learning , unified decoder , and unified tokenizer . In summary, benefiting from the scalability and flexibility of the transformer, adopting one model for all tasks has made a great progress .

Meanwhile, by combining vision models and language models , research on multi-modal models also adopts transformer-based design. One representative work, LLaVA , treats visual tokens as the inputs of LLMs and makes LLMs understand visual contents. Several works adopt similar designs , and all of them are termed Multi-modal Large Language Models (MLLMs). After that, most research focuses on improving MLLM benchmarks in various ways, including increasing data sizes and enhancing the visual encoders and visual resolutions . However, LLaVA-like models cannot output precise location information since they only carry out image-level analysis. Thus, recent works try to fill this gaps by adding extra detection models for object level analysis, mask decoder for pixel-level analysis, visual prompts, and also propose task-specific instruction tuning with various datasets. By providing extra detection data and a decoder, the updated MLLMs can perform localization output. However, these models are specifically tuned on specific tasks, losing the ability of LLaVA for image level analysis, such as caption and visual question answering. Meanwhile, several works adopt LLMs as agents to collaborate with various visual models or generation models. Despite the works being simple and effective, the inference and parameter costs are huge due to the multiple visual encoders and decoders. Moreover, there are no specific designs for task unification.

Motivated by the previous analysis, we ask one essential question: Can we bridge image-level, object-level, and pixel-level tasks into one MLLM model with only one LLM, one visual encoder, and one visual decoder? Back to the universal perception models, we can leverage these models to help us build a stronger MLLM to unify three-level inputs, including image, object, and pixel levels. In particular, we adopt OMG-Seg as our universal perception model due to its simplicity and effectiveness in various segmentation tasks.

In this work, we present OMG-LLaVA, an elegant MLLM that bridges image-level, object-level, and pixel-level reasoning and understanding tasks in one model. We preserve the basic pixel-level segmentation ability of OMG-Seg by freezing the visual encoder and decoder, as shown in the bottom left of Fig. 1. Since the LLM processes text input, OMG-LLaVA can also perform referring segmentation, reasoning segmentation, and grounded conversation and generation, shown in the top left of Fig. 1. Moreover, as shown in Fig. 1, with the help of LLMs, OMG-LLaVA can also perform image-level understanding as LLaVA, including caption and conversation, where most MLLMs for grounding lose such ability. In addition, OMG-LLaVA also supports the visual prompts as inputs, which results in object level understanding, such as visual prompt-based conversation and region-level captions. We achieve all these abilities using one LLM, one encoder, and one decoder.

In particular, to better encode the visual segmentation outputs, we propose a perception prior embedding module to absorb the object queries into object-centric visual tokens, which are the inputs of LLMs. We present a unified instruction formation strategy, which lets the model accept visual images, texts, and visual prompts as inputs and generate the response of text, segmentation tokens, segmentation masks, and labels. Following the LLaVA , we adopt pretraining and instruct tuning pipelines. Extensive experiments show the effectiveness of our components and training strategy. In addition to visual segmentation, OMG-LLaVA can also achieve good enough performance on 6 datasets, including COCO panoptic segmentation, VIPSeg video panoptic segmentation, refCOCO, refCOCO+, refCOCOg referring expression segmentation, GranDf grounded conversation generation, and refCOCOg region caption datasets. We hope our research can inspire the research on MLLM design in a more elegant way for the community.

Related Work

Multimodal Large Language Models. Early multimodal models explore better fusion strategies, various feature extractors, and different meta-architectures. Most works focus on single tasks, such as caption and VQA. With the development of the large language models , recent works mainly explore building an instruction-tuning pipeline for multiple multimodal benchmarks . LLaVA is one earlier work that treats visual features as tokens. After that, several works explore visual cues to enhance the visual inputs of LLaVA. On the other hand, several works add extra components to adapt LLaVA for visual grounding, detection, segmentation, and video analysis. In particular, several works explore language-driven grounding and segmentation. However, these works are all trained with a specific purpose. We aim to build the simplest model to unify segmentation, instruction tuning, and prompt-driven segmentation in one model. To the best of knowledge, we are the first model to achieve this goal.

Unified Segmentation Models. The vision transformers have led to research interest in universal segmentation. Recent works have developed mask classification architectures with an end-to-end set prediction approach, outperforming previous specialized models in both image and video segmentation tasks . In particular, several works adopt one model with shared parameters to perform various segmentation tasks. One recent work, OMG-Seg , first unifies image, video, open-vocabulary, and interactive segmentation in one simple model. However, all of these works focus on visual segmentation and lack the ability to generate interactive text and visual prompts, like MLLMs. Our work builds such a bridge to align MLLMs, visual segmentation, and prompt-driven segmentation models from joint co-training and model sharing, which serves as a new baseline for this field.

Language-driven Location and Segmentation. Early works in this direction mainly define the various language-driven tasks, including referring segmentation and referring localization. Most works design effective fusion modules to achieve better performance. Meanwhile, several works explore more complex language-driven tasks from various aspects, including robustness, reasoning, and region-level caption. LISA involves reasoning-based segmentation. Then, GLaMM annotates a new dataset and proposes region-level caption and segmentation tasks. Meanwhile, several works use LLMs as agents to assign different visual experts. In contrast to these works, our method is a more elegant baseline, which contains only one visual encoder, one LLM, and one decoder.

Visual Prompts. With the prompting ability of LLMs, several works also explore visual prompting methods in vision. According to the design and purposes, these works can be divided into different aspects, including learnable tokens , mask-visual-modeling for different tasks , and various visual prompting encoders for visual outputs . Our OMG-LLaVa also supports visual prompts for better interaction with the user’s inputs, showing the potential for product purposes.

Methodology

Motivation and Our Goals. The LLMs unify most NLP tasks as token generation tasks and exhibit strong reasoning and instruction-following capabilities. As shown in Fig. 2 (a), LLaVA-like models further introduce visual tokens into LLMs, enabling LLMs to understand visual information and perform visual-based reasoning. However, they cannot accomplish fine-grained visual tasks like object-level and pixel-level understanding and reasoning. As shown in Fig. 2 (b), introduce region-level visual embeddings, allowing LLMs to achieve object-level understanding and reasoning tasks. However, these models rely on complex region embedding extraction designs. In addition, most cannot perform pixel-level understanding tasks. Thus, as shown in Fig. 2 (c), introduce segmentation tokens, enabling LLMs to output segmentation masks and thus handle pixel-level understanding and reasoning tasks. Nonetheless, they require a large segmentation module, such as SAM , making the system highly redundant. As shown in Fig. 2 (d), GLAMM combines the above pipelines to handle object-level and pixel-level tasks. However, this significantly increases the system’s complexity and redundancy. Additionally, GLAMM relies on explicit instructions from the user, losing the perception ability to handle basic pixel-level understanding tasks such as instance segmentation, semantic segmentation, panoptic segmentation, and interactive segmentation.

In this paper, we focus on addressing all the challenges above in a more simple yet elegant way. Our OMG-LLaVA unifies image-level (such as image caption and image-based conversation), object-level (such as region caption and visual prompt-based conversation), and pixel-level (such as universal segmentation, referring segmentation, reasoning segmentation, and grounded conversation generation) visual understanding and reasoning tasks into token-to-token generation. The framework follows a simple and elegant system design, including only one visual perception module and one large language model.

Unified View of Different Tasks. We model various tasks as the token-to-token generation to bridge the gap between image-level, object-level, and pixel-level understanding and reasoning. To support these tasks, we define three types of tokens: text tokens $T_{t}$ , pixel-centric visual tokens $T_{pv}$ , and object-centric visual tokens $T_{ov}$ . Text tokens encode textual information. Pixel-centric visual tokens represent dense image features, providing the LLM with comprehensive image information. Object-centric visual tokens encode the features of specified objects, offering the LLM object-centric information, and can be easily decoded into segmentation masks.

For example, in the classic image-level understanding task, i.e., image caption, a text response $T_{t}^{out}$ is generated based on text instruction $T_{t}^{in}$ and image features $T_{pv}^{in}$ . In the object-level understanding task, region captioning, the text response $T_{t}^{out}$ is generated based on text instruction $T_{t}^{in}$ , image features $T_{pv}^{in}$ , and specified object-centric visual tokens $T_{ov}^{in}$ . The pixel-level reasoning task, referring segmentation, involves generating object-centric visual tokens $T_{ov}^{out}$ based on text instruction $T_{t}^{in}$ and image features $T_{pv}^{in}$ . Additionally, OMG-LLaVA can support various mixed-level tasks, such as providing grounded descriptions around specified objects.

Pixel-centric visual tokens can be obtained by tokenizing images using a CLIP backbone as the tokenizer. However, object-centric visual tokens require encoding object information to be easily decoded into segmentation masks. Therefore, methods like mask pooling in Osprey and ROI pooling in GLaMM fail to meet these requirements. We found that a universal perception decoder can meet all the requirements. Thus, we chose the OMG-Seg decoder as the object-centric tokenizer due to its comprehensive capabilities.

2 OMG-LLaVA Framework

The framework of OMG-LLaVA is shown in Fig. 2 (e). OMG-LLaVA comprises a large language model (LLM) and a frozen universal perception module. The universal perception module encodes images and visual prompts from users into pixel-centric and object-centric visual tokens. It obtains object-centric visual tokens output by the LLM into explicit segmentation mask responses. The LLM accepts text instruction tokens and pixel-centric and object-centric visual tokens from the universal perception module as inputs and then outputs text responses along with object-centric visual tokens. The detailed architecture of OMG-LLaVA is illustrated in Fig. 3. The universal perception module comprises an image encoder, an OMG decoder , and a non-trainable perception prior embedding component.

Image Encoder. To maximize the perception capabilities of the universal perception module, we use the ConvNeXt-L -based CLIP model as the image encoder and employ a high image resolution (1024 $\times$ 1024). However, the large image resolution results in excessive visual tokens input into the LLM, leading to significantly higher computational costs than using lower-resolution images (such as 224 $\times$ 224 or 336 $\times$ 336). We address this issue by utilizing the lowest resolution image features (32 $\times$ downsampling). Additionally, we use the pixel shuffle operator to further reduce the image features’ resolution. Ultimately, the downsampling factor for the image features used to generate visual tokens is 64, meaning that a 1024 $\times$ 1024 image produces 256 visual tokens.

OMG Decoder. We utilize the OMG decoder to generate object-centric visual tokens, furnishing the LLM with information regarding the primary objects in the image and those mentioned by the user’s input visual prompts. As shown on the left side of Fig. 4, the OMG decoder comprises masked cross-attention and self-attention layers. The OMG decoder’s input includes a set of learnable object queries for automatically capturing all objects of interest and visual prompt queries derived from encoded input visual prompts . The visual prompt queries and learnable object queries are collectively termed object queries. The OMG decoder probes feature for object queries from the image features by employing masked cross-attention and models relationships between objects through self-attention. The object queries can be decoded into segmentation masks and object categories via a simple FFN layer. With the OMG decoder, OMG-LLaVA can efficiently tokenize object information into object-centric visual tokens, thereby equipping the LLM with information about objects in the image and those referenced by the user.

The OMG decoder can accept point prompts as input. While box and mask prompts can be easily converted into point prompts, this crude conversion significantly loses prompt information, complicating the explicit encoding of the user’s intent. To address this, we can impose constraints on the attention masks of the masked cross-attention layers based on the visual prompt to precisely encode the object information referenced by the prompt. As depicted on the right side of Fig. 4, we utilize the box coordinates to define attention masks for all pixel features outside the box for box prompts. Similarly, we directly employ the provided object mask to generate attention masks for mask prompts. With this straightforward attention mask modification strategy, OMG-LLaVA can accurately capture the user’s visual prompts, encompassing point, box, and mask prompts.

Perception Prior Embedding. We find that directly combining a frozen perception module with LLM doesn’t perform well, as also observed in LISA . To retain the full capabilities of the universal perception module, OMG-LLaVA doesn’t fine-tune the perception module to adapt to the output of the large language model. Instead, we propose a perception prior embedding strategy to tackle this challenge. Fig. 5 illustrates the perception prior embedding strategy.

Then, we compute a weighted average of the object queries $\mathcal{Q}$ based on the mask score $MS$ and obtain the corresponding weighted object queries for each pixel. Pixel-centric visual tokens $T_{pv}$ are obtained by adding the weighted object queries to the image features $\mathcal{F}$ :

Additionally, we treat the foreground object queries as object-centric visual tokens $T_{ov}$ . The object-centric visual tokens $T_{ov}$ are concatenated with the pixel-centric visual tokens $T_{pv}$ to form the visual tokens $T_{v}=(T_{pv},T_{ov})$ , which are input to the LLM to provide rich perception prior information.

Visual Projector and Text Projector. Following , we use an MLP as the visual projector, which is responsible for mapping visual tokens to the LLM’s text embedding space. Since our visual tokens consist of pixel-centric and object-centric tokens, the visual projector comprises two MLPs, each handling one type of visual token separately. Inspired by , we also use a simple MLP to map the LLM output’s hidden states of the [SEG] token to the visual space.

Instruction Formulation. OMG-LLaVA can accept visual input, text input, and visual prompt input and output text responses and segmentation token, segmentation masks and labels. Thus, it can handle tasks such as image captioning, image-based conversation, region captioning, visual prompt-based conversation, referring segmentation, reasoning segmentation, grounded conversation, etc. We use a unified instruction formulation to support these functionalities. As shown in Fig. 3, there are three special tokens: , , and [SEG]. Before being fed into the LLM, the token is replaced by visual tokens $T_{v}$ , and the token can be replaced by any object-centric visual token encoded by the visual prompt. The [SEG] token in the LLM’s output is sent to the frozen OMG decoder to be decoded into a segmentation mask.

3 Training and Testing Setup

Training. Following LLaVA , our OMG-LLaVA performs two-stage training: pretraining and instruction tuning. During the pretraining stage, the perception model and LLM are frozen, and only the visual and text projectors can be tuned. In addition to the text regression loss, we apply regularization penalties to the visual projector $\mathcal{P}_{v}$ and text projector $\mathcal{P}_{t}$ to preserve object-centric information as much as possible.

During instruction tuning, in addition to finetuning the visual projector and text projector, we use LoRA to finetune the LLM. Following , besides the text regression loss, we apply cross-entropy loss and dice loss to supervise the segmentation mask decoded by the [SEG] token.

Testing. The image-level, object-level, and pixel-level understanding and reasoning tasks can all be encompassed within the Eq. 1 paradigm. During the inference stage, we encode the necessary task requirements, such as text prompts, visual prompts, and image features, into tokens to input into the LLM. The output tokens of LLM are then decoded into text responses and segmentation mask responses according to the task definition. We refer the readers to check the more details in the appendix.

Experiment

Dataset setup. During the pretraining stage, we use the LLaVA pretraining dataset to perform visual-text alignment, following LLaVA. The instruction tuning process of OMG-LLaVA involves a diverse range of tasks and datasets. For image-level understanding and reasoning tasks, we use the LLaVA dataset , which includes 665K descriptions, reasoning, and conversation data. For object-level understanding and reasoning, we use the object-level description and conversation data from the Osprey dataset and the object-level point-prompt data from the MDVP dataset , which contain approximately 74K and 200K data, respectively. For pixel-level understanding and reasoning, we use the referring segmentation datasets, including refCOCO, refCOCO+ , refCOCOg , and refClef, totaling 74K data. Additionally, semantic segmentation datasets, including ADE20k and COCO-stuff , totaling 26K data, and the grounded conversation generation dataset GranDf , containing 200K data, are used.

Implementation details. We use the pre-trained ConvNext-L OMG-Seg as the universal perception module and InterLM2-7B as the LLM for OMG-LLaVA. We adopt xtuner codebase to build our model and data pipeline. The image is resized to 1024 $\times$ 1024. During the pretraining stage, only the visual projector and text projector are trained, with an initial learning rate set to 1e-3. During the instruction tuning stage, the initial learning rate is set to 2e-4, with only the perception model kept frozen, and the LLM is fine-tuned using LoRA . The maximum sequence length in the LLM is set to 2,048. All training is conducted on four NVIDIA A800 GPUs with 80GB of memory. The pretraining stage and instruction tuning stage took 7 hours and 48 hours, respectively.

Comprehensive comparison with MLLMs. OMG-LLaVA is comprehensively compared with current MLLMs with perception capabilities, and the results are shown in Tab. 2. OMG-LLaVA demonstrates the most comprehensive capabilities. It achieves performance comparable to the SOTA in referring segmentation, grounded conversation generation, and region captioning. Additionally, OMG-LLaVA retains basic segmentation ability, enabling it to handle universal image and video segmentation tasks. Compared to other MLLMs, OMG-LLaVA features a simple and elegant system design, incorporating only a single visual encoder.

Referring expression segmentation. We evaluate OMG-LLaVA on refCOCO, refCOCO+, and refCOCOg, with the results shown in Tab. 3. OMG-LLaVA outperforms LISA by 1.5 cIoU, 3.2 cIoU, and 4.3 cIoU on the validation sets of refCOCO, refCOCO+, and refCOCOg, respectively, while keeping the OMG decoder frozen and using only a single visual encoder. When we unfreeze the OMG decoder and finetune OMG-LLaVA on the referring expression segmentation task, OMG-LLaVA achieves 78.0, 69.1, and 72.9 cIoU on refCOCO, refCOCO+, and refCOCOg, respectively, surpassing LISA by 3.1, 4.0, and 5.0 cIoU. Compared to PixelLM , OMG-LLaVA shows performance improvements of 5.0 cIoU and 3.6 cIoU on refCOCO and refCOCOg, respectively.

Grounded conversation generation. Grounded conversation generation is a comprehensive and complex task that involves both image-level and pixel-level understanding and reasoning. MLLMs need to have the ability to provide fine-grained image descriptions and pixel-level understanding, linking the objects in the image captions to the corresponding segmentation masks. As shown in Tab. 4, when trained with comparable data, OMG-LLaVA surpasses LISA by 1.9 METEOR and 7.3 CIDEr in image description ability. In terms of pixel understanding, OMG-LLaVA also outperforms LISA by 4.7 AP50 and 3.5 mIoU, even though LISA uses SAM and finetunes its segmentation decoder. Despite GLaMM using much more training data than OMG-LLaVA, OMG-LLaVA demonstrates comparable pixel-understanding capabilities, outperforming GLaMM with 0.6 CIDEr, 1.4 AP50 and 0.1 mIoU on the test set.

2 Ablation and Analysis

Ablation study. We conduct ablation studies on referring expression segmentation and grounded conversation generation datasets, with all training and testing settings consistent with the main experiments. We use a simple combination of OMG-Seg and LLaVA as our baseline, similar to LISA , where the [SEG] tokens output by the LLM were input into OMG-Seg to obtain segmentation masks, with OMG-Seg kept frozen.

As shown in Tab. 5, the baseline performed poorly on the RES datasets. Similarly, it exhibited low segmentation quality on the GCG dataset. This is because the LLM did not acquire any segmentation priors and needed to generate segmentation queries based on image features and adapt them to the input of the frozen perception module, which is a challenging task. When using our proposed perception prior embedding strategy, OMG-LLaVA exhibits performance gains of 13.8 cIoU, 10.6 cIoU, and 11.7 cIoU on refCOCO, refCOCO+, and refCOCOg, respectively. Additionally, the perception prior embedding strategy also brings a performance improvement of 11.1 mIoU on the GCG dataset and a slight improvement in image description capability (0.4 METEOR). When foreground object queries were provided to the LLM, OMG-LLaVA further improved its performance by 1.9 cIoU on refCOCO and 1.5 mIoU on GCG.

We conducted a visualization analysis of the proposed strategies. As shown in the left part of Fig. 6, the simple baseline has poor capability in associating text and segmentation, which is the crucial reason for its poor performance on RES. When using our proposed perception prior embedding strategy, the object query and pixel features are explicitly integrated according to the perception prior, resulting in significantly enhanced text-segmentation association capability. By adopting the object query input strategy, the quality of some challenging segmentation cases, such as the lower right corner of the fence in Fig 6, slightly improves.

Qualitative results. We provide visualization results of OMG-LLaVA on multiple image-level, object-level, and pixel-level tasks in Fig. 1. Additional qualitative visualization results or comparable visual results for referring expression segmentation and grounded conversation generation are presented in the appendix.

Conclusion

We present a new MLLM, OMG-LLaVA, which bridges image-level, object-level, and pixel-level understanding and reasoning in one model. Our method only contains one image encoder, one LLM, and one decoder. With proposed perception prior embedding and unified task instruction tuning, OMG-LLaVA can perform over 8 different multi-modal learning tasks, as well as preserving the visual perception ability of the OMG-Seg baseline. Our method can achieve comparable results, compared with previous combined works, with much less trainable parameters and computation costs. We hope our work can inspire the community to rethink the design of the MLLM meta-architecture to minimize the model components and maximize the MLLM’s functionalities.

References

Appendix A Appendix

Overview. In this appendix, we will first give more implementation and training details of our method. Then, we present more detailed ablation studies on several component designs. Next, we present more detailed visualization results. In the end, we discuss the limitations and future work.

Pre-training. Following LLaVA, OMG-LLaVA first performs pre-training to learn the projector that projects visual tokens into the text space. During the pre-training stage, we freeze the visual encoder, OMG head, and LLM to train the visual projector for projecting visual tokens into the text space and to train the text projector for restoring the projected object-centric visual tokens to the segmentation embedding. The training data used in the pre-training stage is the same as that used in LLaVA. In this stage, the OMG-LLaVA is trained for 1 epoch. The batch size is 256, with 32 per GPU, and the learning rate is 0.001.

Supervised fine-tuning. During the instruction tuning stage, we freeze the visual encoder and OMG head, finetune the LLM using LoRA, and fully finetune the text and visual projectors. We train OMG-LLaVA for 1 epoch on all instruction tuning datasets, including the LLaVA instruction tuning dataset, referring expression segmentation datasets, semantic segmentation datasets, grounded conversation generation datasets, mask-based visual prompt datasets, and point-based visual prompt datasets. The batch size is 128, with 16 per GPU, and the learning rate is 2e-4.

Inference details for each task. OMG-LLaVA generates answers token by token during the inference stage based on the given question. We use a fixed template for the referring expression segmentation task to create the question: “Please segment {EXPRESSION} in this image." In rare cases where OMG-LLaVA does not predict the [SEG] token, we use an empty mask as the segmentation result. We use the fixed question for the grounded conversation generation task: “Could you please give me a detailed description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer." For other tasks, we remove special tokens such as

, and [SEG] from OMG-LLaVA’s responses to ensure the answers contain only text.

A.2 More Detailed Ablation Studies.

Projector for object-centric visual tokens. We conducted ablation experiments on the vision projector. The results are shown in Tab. 6. We use a simple MLP projector as the baseline for object-centric visual tokens. When we added a cross-attention layer to the projector, performance on segmentation and visual prompt-based tasks decreased. This is because the introduction of the cross-attention layer caused the object-centric visual tokens to incorporate too many pixel-centric visual tokens, leading to interference with the object information. Furthermore, when the projector for object-centric visual tokens generated from visual prompt input and object queries is not shared, performance declines on segmentation and visual prompt-based tasks. Therefore, a shared MLP projector can effectively project object-centric visual tokens into the text space.

Answer format for segmentation-based tasks. In LISA , the response for the referring expression segmentation task is fixed as “Sure, it is [SEG]." However, this fixed answer may interfere with the instruction-following ability of the LLM, leading it to respond with “Sure, it is [SEG]." for new instructions. In GLaMM , for the grounded conversation generation task, the response is typically “

Expression

[SEG]." Since the “Expression" is flexible and variable, the LLM is less likely to overfit to a fixed response.

We conduct ablation experiments on the answer format for segmentation tasks, and the results are shown in Tab. 7. We find that unifying the answer format for segmentation tasks (including RES and GCG) as “

Expression

[SEG]" yields better performance. This more flexible answer format not only achieves better performance in the referring expression segmentation task compared to the fixed answer but also avoids the damage to the LLM’s instruction-following ability.

Segmentation embeddings. We conduct ablation experiments on the generation strategy of segmentation embedding, and the results are shown in Tab. 8. We explore whether the hidden states of the intermediate layers corresponding to the [SEG] token are helpful for segmentation. Compared to using the hidden states of the last layer of the [SEG] token as the segmentation embedding, using the mean of the hidden states from all layers as the segmentation embedding resulted in negligible improvement on refCOCO but led to a significant performance drop on the more challenging refCOCOg. Concatenating the hidden states from all layers of the [SEG] token as the segmentation embedding resulted in a significant performance drop across all RES tasks. Therefore, the hidden state of the last layer already contains sufficient features to generate the segmentation mask, and introducing hidden states from other intermediate layers does not yield better segmentation results.

A.3 More Visualization Results

Qualitative comparison with SOTA methods. We conduct qualitative comparisons and analyses on various tasks, including referring expression segmentation, grounded conversation generation, and image-based conversation, against the SOTA methods LISA and GLaMM . Fig. 7 shows the visualization results of the RES task for LISA, GLaMM, and our proposed OMG-LLaVA. OMG-LLaVA demonstrates a more stable segmentation performance than LISA and GLaMM. Additionally, OMG-LLaVA exhibits better image and text understanding capabilities than LISA (13B) and GLaMM, as illustrated in the fourth column with the example of "the smallest chair".

Fig. 8 shows the visualization results of the GCG task for GLaMM and OMG-LLaVA. Our proposed OMG-LLaVA provides more detailed and accurate descriptions of the scene, such as “lighthouse" and “bear." Additionally, OMG-LLaVA demonstrates more stable segmentation capabilities, as seen in the “mountain" in the bottom-right corner image.

Fig. 9 shows the visualization results of the visual prompt-based description task for GLaMM and OMG-LLaVA. Compared to GLaMM, OMG-LLaVA supports more flexible visual prompts, including point, box, and mask prompts. Additionally, OMG-LLaVA can generate more detailed object captions and demonstrate a more accurate image understanding.

Fig. 10 shows the visualization results of the image-based conversation task for LISA , GLaMM , and OMG-LLaVA. Compared to LISA and GLaMM, OMG-LLaVA has stronger instruction-following ability. For example, when answering the question, “What is the number on the jersey of the athlete squatting on the ground?" both LISA and GLaMM incorrectly segmented “the jersey of the athlete squatting on the ground." Compared to GLaMM, OMG-LLaVA can provide more detailed and accurate answers to user questions. Compared to LISA, OMG-LLaVA demonstrates stronger scene understanding and reasoning abilities. For instance, in question 3 of Fig. 10, LISA gave an utterly incorrect answer despite using a larger LLM (13B).

Visualization results of RES. We provide additional visualization results of OMG-LLaVA on the RES task in Fig. 11. OMG-LLaVA demonstrates a strong understanding of spatial relationships and human actions, enabling it to accurately and reliably segment the specified objects based on these descriptions. Furthermore, even without training on any reasoning segmentation data, OMG-LLaVA exhibits the ability to perform reasoning segmentation. As shown in Fig. 12, OMG-LLaVA can infer the target based on the question and accurately segment the corresponding object.

Visualization results of GCG. As depicted in Fig. 13, our method performs well on the grounded conversation generation task. OMG-LLaVA demonstrates strong scene understanding and object segmentation capabilities. Although some objects are overlooked, this is due to the omission of many objects in the image captions of the Grandf dataset. We believe that using higher-quality data for training would result in even better performance for OMG-LLaVA.

Visualization results of visual prompts-based description. Fig. 14 shows more visualization results for the visual prompt-based description task. OMG-LLaVA supports input of point, box, and mask-based visual prompts and provides detailed descriptions. These descriptions include information about the objects and their relationships with other objects in the scene.

A.4 Limitation and Future Work Discussion

Limitations of OMG-LLaVA. Although OMG-LLaVA achieves image-level, object-level, and pixel-level capabilities with a concise and elegant architecture, much room still exists for improvement. Firstly, joint training with pixel-level understanding data often leads to decreased image-level capability, a phenomenon widely observed in LISA and GLaMM . This challenge could be addressed by organizing the data to eliminate this conflict. Secondly, due to the lack of multi-granularity segmentation capability in OMG-Seg, OMG-LLaVA cannot perform part-level segmentation. This challenge could be addressed using a more powerful and universal perception module by adding part-level visual inputs.

Future Works. Several future directions can be explored with our new meta-architecture. We list two potential directions, including video and more instruction-tuning data. Although OMG-Seg can acquire the video inputs, OMG-LLaVA still cannot perform pixel-level spatial-temporal reasoning. This is due to the lack of such datasets. Moreover, more instruction-tuning data involve more localization outputs, and multiple round conversations can be used to build a stronger MLLM model. For example, we plan to use full GLaMM datasets and more detection datasets for joint co-training as future work if more computation resources are available.