Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei

Introduction

Multimodal Large Language Models (MLLMs) have successfully played a role as a general-purpose interface across a wide range of tasks, such as language, vision, and vision-language tasks. MLLMs can perceive general modalities, including texts, images, and audio, and generate responses using free-form texts under zero-shot and few-shot settings.

In this work, we unlock the grounding capability for multimodal large language models. Grounding capability can provide a more convenient and efficient human-AI interaction for vision-language tasks. It enables the user to point to the object or region in the image directly rather than input detailed text descriptions to refer to it, the model can understand that image region with its spatial locations. Grounding capability also enables the model to respond with visual answers (i.e., bounding boxes), which can support more vision-language tasks such as referring expression comprehension. Visual answers are more accurate and resolve the coreference ambiguity compared with text-only responses. In addition, grounding capability can link noun phrases and referring expressions in the generated free-form text response to the image regions, providing more accurate, informational, and comprehensive answers.

We introduce Kosmos-2, a multimodal large language model with grounding capability built upon Kosmos-1. Kosmos-2 is a Transformer-based causal language model and is trained using the next-word prediction task. In order to unlock the grounding capability, we construct a web-scale dataset of grounded image-text pairs, and combine it with the multimodal corpora in Kosmos-1 to train the model. The grounded image-text pairs are built upon a subset of image-text pairs from LAION-2B and COYO-700M . We construct a pipeline to extract and link the text spans (i.e., noun phrases and referring expressions) in the caption to the spatial locations (e.g., bounding boxes) of its corresponding objects or regions in the image. We convert the spatial coordinates of the bounding boxes to a sequence of location tokens, which is then appended after its respective text spans. The data format serves as a “hyperlink” to connect the objects or regions of the image to the caption.

Experimental results demonstrate that Kosmos-2 not only achieves competitive performance on language and vision-language tasks evaluated in Kosmos-1, but also achieves impressive performance on grounding tasks (phrase grounding and referring expression comprehension) and referring tasks (referring expression generation). As shown in Figure 2, integrating the grounding capability enables Kosmos-2 to be used for more downstream tasks, such as grounded image captioning, and grounded visual question answering.

Construction of Web-Scale Grounded Image-Text Pairs (GrIT)

We introduce GrITA subset of GrIT can be downloaded at https://aka.ms/kosmos-2., a large-scale dataset of Grounded Image-Text pairs, which is created based on image-text pairs from a subset of COYO-700M and LAION-2B ). We construct a pipeline to extract and link text spans (i.e., noun phrases and referring expressions) in the caption to their corresponding image regions. The pipeline mainly consists of two steps: generating noun-chunk-bounding-box pairs and producing referring-expression-bounding-box pairs. We describe these steps in detail below:

Given an image-text pair, we first extract noun chunks from the caption and associate them with image regions using a pretrained detector. As illustrated in Figure 3, we use spaCy to parse the caption (“a dog in a field of flowers") and extract all noun chunks (“a dog”, “a field” and “flowers”). We eliminate certain abstract noun phrases that are challenging to recognize in the image, such as “time”, “love”, and “freedom”, to reduce potential noise. Subsequently, we input the image and noun chunks extracted from the caption into a pretrained grounding model (e.g., GLIP ) to obtain the associated bounding boxes. Non-maximum suppression algorithm is applied to remove bounding boxes that have a high overlap with others, even if they are not for the same noun chunk. We keep noun-chunk-bounding-box pairs with predicted confidence scores higher than 0.65. If no bounding boxes are retained, we discard the corresponding image-caption pair.

In order to endow the model with the ability to ground complex linguistic descriptions, we expand noun chunks to referring expressions. Specifically, we use spaCy to obtain dependency relations of the sentence. We then expand a noun chunk into a referring expression by recursively traversing its children in the dependency tree and concatenating children tokens with the noun chunk. We do not expand noun chunks with conjuncts. For noun chunks without children tokens, we keep them for the next process. In the example shown in Figure 3, the noun chunk ‘a dog’ can be expanded to “a dog in a field of flowers”, and the noun chunk ‘a field’ can be expanded to “a field of flowers”.

Furthermore, we only retain referring expressions or noun chunks that are not contained by others. As shown in Figure 3, we keep the referring expression “a dog in a field of flowers” and drop “a field of flowers” (as it is entailed by “a dog in a field of flowers”) and ‘flowers’. We assign the bounding box of the noun chunk (‘a dog’) to the corresponding generated referring expression (“a dog in a field of flowers”).

In the end, we obtain approximately 91M images, 115M text spans, and 137M associated bounding boxes. We compare GrIT with existing publicly accessible visual grounding datasets in Table 1. Data samples of GrIT are shown in the Appendix.

Kosmos-2: A Grounded Multimodal Large Language Model

Kosmos-2 is a grounded multimodal large language model, which integrates grounding and referring capabilities compared with Kosmos-1. The model can accept image regions selected by the user using bounding boxes as input, provide visual answers (i.e., bounding boxes), and ground the text output to the visual world. Kosmos-2 adopts the same model architecture and training objective as Kosmos-1. We add grounded image-text pairs into the training data to endow the model with grounding and referring capabilities. For a text span (such as noun phrase and referring expression) and its corresponding bounding boxes in a grounded image-text pair, We discretize continuous coordinates of bounding boxes into a sequence of location tokens to encode with text tokens in a unified way. Then we link the location tokens and their corresponding text span via a “hyperlink” data format. The model is trained to establish a mapping between image regions and their corresponding location tokens and connect the image regions with their associated text spans.

Given a text span and its associated bounding boxes in a grounded image-text pair, we first convert the continuous coordinates of bounding boxes into a sequence of discrete location tokens . For an image with width $W$ and height $H$ , we evenly divide both the width and height into $P$ segments each. $P\times P$ bins are obtained and each bin consists of ( $\nicefrac{{W}}{{P}}$ ) $\times$ ( $\nicefrac{{H}}{{P}}$ ) pixels. For each bin, we use a location token to represent the coordinates within that bin. We use the coordinates of the center pixel of each bin to determine bounding boxes on the image. In total, $P\times P$ location tokens are introduced, and these tokens are added to word vocabulary to enable unified modeling with texts.

The bounding box can be represented using its top-left point ( $x_{1}$ , $y_{1}$ ) and bottom-right point ( $x_{2}$ , $y_{2}$ ). We discretize the top-left and bottom-right corner points to location tokens, respectively. We concatenate the top-left location token , the bottom-right location token , and special boundary tokens and , to represent a single bounding box: “”. If the text span is associated with multiple bounding boxes, we use a special token to concatenate the location tokens of these bounding boxes: “ ${}_{1}^{i}$ ”.

Then we arrange the text span and its associated location tokens in a format resembling a “hyperlink” in markdown. For the text span with a single bounding box, the resulted sequence is “

text span

”, where

and

are special tokens indicating the beginning and end of the text span. The data format tells the model that image regions within the bounding box are associated with the text span.

For the example shown in Figure 1, the input representation is:

~~Image Embedding~~

seats next to
a campfire
where ~~and~~ indicate start- and end-of-sequence, and and represent the beginning and end of encoded image embeddings. is a special token to tell the model ground the text output to the visual world. We map input text tokens and location tokens to embeddings via a lookup table. Following Kosmos-1, a vision encoder and a resampler module are used to obtain image embeddings for input images.

For language-only data, cross-modal paired data (i.e., image-text pairs), and interleaved multimodal data, we use the same input representations as of Kosmos-1.

2 Grounded Multimodal Large Language Models

Based on Kosmos-1, Kosmos-2 enhances multimodal large language models by incorporating grounding and referring capabilities. Kosmos-2 also uses a Transformer-based causal language model as the backbone and is trained with the next-token prediction task.

In addition to multimodal corpora used in Kosmos-1 (including text corpora, image-caption pairs, and interleaved image-text data), we add grounded image-text pairs into training. The training loss only considers discrete tokens, such as text tokens and location tokens. The model can learn to locate and understand image regions by their location tokens and the whole image, associate text spans to image regions, and output bounding boxes of the image region using location tokens.

Kosmos-2 shows new capabilities of grounding and referring. The referring capability enables us to point out image regions with bounding boxes. Kosmos-2 can understand the image regions users refer to by the coordinates of bounding boxes. The referring capability provides a new interaction method. Different from previous MLLMs , which can only provide text output, Kosmos-2 can provide visual answers (i.e., bounding boxes) and ground text output to the image. The grounding capability enables the model to provide more accurate, informative, and comprehensive responses. In addition to vision, language, and vision-language tasks evaluated in Kosmos-1, the model can be used for more downstream tasks, such as grounded image-captioning, grounded VQA, referring expression comprehension and generation.

3 Model Training

We train the model on newly added grounded image-text pairs, monomodal text corpora, image-caption pairs, and interleaved image-text data. Our training process involves a batch size of 419K tokens, consisting of 185K tokens from text corpora, 215K tokens from original and grounded image-caption pairs, and 19K tokens from interleaved data. We train Kosmos-2 for 60k steps, equivalent to around 25 billion tokens. The AdamW optimizer is employed with $\beta=(0.9,0.98)$ . We set the weight decay to 0.01 and the dropout rate to 0.1. The learning rate increases to 2e-4 during the first 375 warm-up steps and linearly decays to zero. We train the model on 256 V100 GPUs and the training takes approximately one day to complete. In order to tell the model when to ground text output to the visual world, we prepend the ‘’ token to the grounded caption during training.

Following Kosmos-1, the vision encoder has 24 layers with 1,024 hidden size and 4,096 FFN intermediate size. The multimodal large language model component is a 24-layer Magneto Transformer with 2,048 hidden dimensions, 32 attention heads, and 8,192 FFN intermediate size. The total number of trainable parameters amounts to approximately 1.6B. The image resolution is set to 224 $\times$ 224 and the patch size is 14 $\times$ 14. We divide the width and height of the image into 32 bins, with each bin consisting of 7 $\times$ 7 pixels. A total of 32 $\times$ 32 location tokens are added to the vocabulary. Kosmos-2 uses the weights of Kosmos-1 for initialization, the newly added word embeddings of location tokens are initialized randomly. We update all the parameters during training and instruction tuning.

After the model is trained, we perform instruct tuning to better align Kosmos-2 with human instructions. we combine vision-language instruction dataset (i.e., LLaVA-Instruct ) and language-only instruction datasets (i.e., Unnatural Instructions and FLANv2 ) with the training data to tune the model. In addition, we construct grounded instruction data by utilizing the pairs of bounding boxes and expressions (i.e., noun phrases, and referring expressions) in GrIT. Given an expression-bounding-box pair, we use “

expression

” as the input instruction, and prompt the model to generate the corresponding location tokens of the bounding boxes. We also use the prompt like “

is” to ask the model to generate expressions according to its bounding boxes. Table B in Appendix presents more templates.

Evaluation

We first evaluate Kosmos-2 on multimodal grounding and multimodal referring tasks to assess the new capabilities, and then test the model on language and perception-language tasks evaluated in Kosmos-1.

In order to evaluate the ability of multimodal grounding, we test Kosmos-2 on widely used phrase grounding and referring expression comprehension tasks in a generation manner. Phrase grounding task requires the model to predict a set of bounding boxes based on one or more given phrases that maybe interrelated within a single caption. Referring expression comprehension task encourages the model to locate the object described in a text referring expression within a given image.

By testing Kosmos-2 on these two tasks, we can assess how well the model performs in grounding text descriptions to the visual world, which is crucial for developing advanced AI systems capable of handling complex multimodal tasks.

For both phrase grounding and referring expression comprehension tasks, Kosmos-2 is required to generate location tokens which are then converted to bounding boxes for evaluation. The input format is “ ~~Image Embedding …”, where “” is used to prompt the model to generate locations tokens.~~

We evaluate phrase grounding task on Flickr30k Entities val and test splits. In order to reduce ambiguity, we do not prompt the model with individual phrases; instead, we use the current phrase along with the preceding words as input where preceding words serve as context: “ …

~~{phrase}~~

”. For the example shown in Figure 4(1), the model needs to predict the locations of phrases “A man”, “a blue hard hat”, “orange safety vest” and “an intersection” in the caption “A man in a blue hard hat and orange safety vest stands in an intersection.”. To generate the location tokens for the phrase “A man” that is the beginning of the caption, the prompt is “
A man
”. For the phrase “orange safety vest”, the prompt is “A man in a blue hard hat and
orange safety vest
”. When multiple men are in the image, the context “A man in a blue hard hat and” explicitly helps the model locate the object to reduce ambiguity.
We obtain the location tokens in “...” from the model response and then covert it into bounding boxes. The generated bounding box is correct if its intersection over union (IoU) with the ground-truth bounding box is greater than 0.5. If Kosmos-2 generates a location sequence that can not be converted correctly (e.g., “”), we treat it as a negative sample. We use ANY-BOX protocol in MDETR . We report the R@1, R@5, and R@10 metrics, where R@1/5/10 means calculating the recall using the top 1/5/10 generated bounding boxes. If there are fewer than 5 or 10 bounding boxes generated by Kosmos-2, we use all available bounding boxes for the calculation.
Table 2 presents results on Flickr30k Entities val and test splits. Kosmos-2 achieves impressive zero-shot performance and outperforms GRILL , which relies on an attached detector, by a large margin. Moreover, our model outperforms traditional finetuned VisualBert model by 7.4% R@1 on both val and test splits. In contrast to other models, Kosmos-2 does not involve prior designs (e.g., object queries or proposals), leading to similar results among R@1, R@5, and R@10. These results demonstrate that Kosmos-2 can generate high-quality locations without the need for post-processing redundant locations. This capability highlights the effectiveness of our model in handling phrase grounding tasks.
1.2 Referring Expression Comprehension
We assess the referring expression comprehension task using three well-established datasets: RefCOCO , RefCOCO+ and RefCOCOg . Both RefCOCO and RefCOCO+ were generated through a two-player game, with RefCOCO+ specifically designed to exclude spatial relations, such as “on the left”. RefCOCOg incorporates spatial relations and features longer expressions on average. Different from phrase grounding on Flickr30k entities, we measure this task by using referring expression as the input: “
referring expression
”. For the example shown in Figure 4(2), the input sequence is “
A man in a blue hard hat and orange safety vest
”. Similarly, the predicted bounding box is considered correct only if its IOU with the ground-truth bounding box is greater than 0.5. The failed decoded sequence is also treated as a negative sample. We use the first generated bounding box for the query expression to measure the accuracy.
Table 3 reports referring comprehension results on RefCOCO , RefCOCO+ and RefCOCOg . Kosmos-2 also obtains promising zero-shot performance on the comprehension task, significantly outperforming previous zero-shot models on RefCOCOg benchmark. However, compared to previous finetuned works, Kosmos-2 achieves slightly lower performance on RefCOCO and RefCOCO+ than on RefCOCOg. This discrepancy can be attributed to the data distribution present in RefCOCO and RefCOCO+, where they tend to use a shorter referring expression (e.g., “left bottom”) during the two-player game. Hence, one of our future goals is to enhance MLLMs’ ability to accurately understand more types of human expressions.
2 Multimodal Referring
In addition to multimodal grounding tasks, we evaluate the model’s ability to understand image regions or objects users refer to via inputting bounding boxes. Compared with previous multimodal LLMs that can only refer image regions or objects to the model via detailed text descriptions, directly referring to image regions using its bounding boxes is more effective and reduces ambiguity.
We evaluate the model on the referring expression generation task, which aims to generate unambiguous text descriptions of specific objects or regions within the bounding box. We employ the widely used RefCOCOg dataset to evaluate the model’s performance under both zero-shot and few-shot settings, showcasing its adaptability in different scenarios.
The model is tasked with generating an associated text description for an object or region given its location tokens of the bounding boxes (e.g., “”). Benefiting from the unified input format, we use “
It
is” as prompt to encourage the model to predict its text description. Figure 5 (1) and (2) demonstrate the input format for zero-shot and few-shot referring expression generation, respectively. Following previous works, we report results using METEOR and CIDEr metrics. The image resolution is 224 $\times$ 224. Greedy search is used for decoding.
2.2 Results
Table 4 presents the zero-shot and few-shot results of referring expression generation on RefCOCOg. We compare Kosmos-2 with a finetuned listener-speaker model, which introduces an added reward-based module (SLR). Our model obtains impressive zero-shot performance on referring expression generation, and even outperforms finetuned SLR by 1.1 CIDEr scores. Moreover, when prompted with fewshot demonstrations, Kosmos-2 shows further improvements, highlighting its in-context learning ability.
3 Perception-Language Tasks
In addition to multimodal grounding and referring tasks, we also evaluate Kosmos-2 on the vision-language tasks following Kosmos-1. In particular, we perform zero-shot evaluations on two popular tasks, including image captioning and visual question answering. Image captioning requires the model to generate a text description of the given image, whereas visual question answering seeks to answer a natural language question based on an image. In order to have a fair comparison with Kosmos-1, we report results without instruction tuning.
For image captioning, we evaluate the model on the widely used Flickr30k Karpathy split test set. We employ beam search for caption generation, with a beam size of 5. We report results using CIDEr metrics evaluated by COCOEvalCaphttps://github.com/salaniz/pycocoevalcap. We use the prompt “An image of” to generate the image description.
For visual question-answering, we evaluate zero-shot performance on the test-dev set of VQAv2. Greedy search is used for decoding. We report VQA scores obtained from VQAv2 evaluation serverhttps://eval.ai/challenge/830/overview. “Question: {question} Answer: {answer}” is used as the prompt for the dataset. The image resolution is 224 $\times$ 224 for both two tasks.
3.2 Results
We present the zero-shot performance on Flickr30k and VQAv2 in Table 5. Kosmos-2 exhibites comparable overall performance to the Kosmos-1, showing a slight improvement on Flickr30k while experiencing a marginal decrease on VQA. While Kosmos-2 introduces new capabilities of grounding and referring, the model still achieves competitive performance on perception-language tasks.
4 Language Tasks
We evaluate Kosmos-2 on eight language tasks, such as cloze and completion tasks (StoryCloze, HellaSwag), Winograd-style tasks (Winograd, Winogrande), commonsense reasoning (PIQA), and three SuperGLUE benchmark datasets (BoolQ, CB, and COPA). We report the zero-shot results in Table 6. Compared with Kosmos-1, Kosmos-2 achieves similar performance on StoryCloze, HellaSwag, Winograd, Winogrande, and PIQA, experiences a decrease in performance on CB, but shows improvement on BoolQ and COPA. In summary, Kosmos-2 demonstrates the acquisition of new capabilities while experiencing comparable performance on language tasks. This illustrates the potential of the model in balancing and expanding its skills across different domains.
Conclusion
We present Kosmos-2, a multimodal large language modal, that can ground to the visual world. Specifically, we pre-train Kosmos-2 by augmenting the multimodal corpora used in Kosmos-1 with GrIT, a large-scale dataset of Grounded Image-Text pairs, which is created by extracting and associating noun phrases and referring expressions in the caption to the objects or regions in the scene. Kosmos-2 enables new capabilities of perceiving image regions and grounding text output to the visual world, which makes grounding as a foundation capability of MLLMs in many downstream applications. Experimental results demonstrate that Kosmos-2 achieves impressive results on language and vision-language tasks evaluated in Kosmos-1, grounding tasks including phrase grounding and referring expression comprehension, and referring tasks such as referring expression generation.
Acknowledgement
Some examples (such as Figure 1) are taken from the WHOOPS corpus .
Ethics Statement
The model presented in this paper is intended for academic and research purposes. The utilization of the model to create unsuitable material is strictly forbidden and not endorsed by this work. The accountability for any improper or unacceptable application of the model rests exclusively with the individuals who generated such content. We also put Microsoft AI Principleshttps://www.microsoft.com/ai/responsible-ai into practice when developing the models.
References
Appendix A Hyperparameters
The training hyperparameters of Kosmos-2 are listed in Table 7.
The instruction tuning hyperparameters are listed in Table 8.
Appendix B Templates for Grounded Instruction Data
Table 9 presents the instruction templates of expression generation based on its associated bounding boxes during instruction tuning.
Appendix C Examples of GrIT
We present some examples of the GrIT corpus in Figures 6, 7, 8 and 9. The grounded image-text pairs span over various domains and contain different numbers of objects.
Appendix D More Examples of Kosmos-2
As illustrated in Figure 10, multimodal referring capability used for visual dialogue can unlock potential in human-AI interaction. In Figure 11, our approach demonstrates its in-context learning ability for fine-grained object detection using both text and image descriptions. Figure 12 and Figure 13 showcase more selected examples, including grounded visual question answering, grounded image captioning, and multimodal referring.