CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang

cs.CV

Introduction

Vision language models are versatile and powerful. Many vision and cross-modality tasks can be formulated as next token prediction, e.g., image captioning (Agrawal et al., 2019), visual question answering (Antol et al., 2015), visual grounding (Yu et al., 2016) and even segmentation (Chen et al., 2022a). Useful abilities like in-context learning (Tsimpoukelli et al., 2021; Sun et al., 2023a; Alayrac et al., 2022) also emerge along with the improvement of downstream tasks when scaling up VLMs. However, to train a large language model is already non-trivial, and it is more challenging to train a VLM from scratch with the same NLP performance as well-trained pure language models like LLaMA2 (Touvron et al., 2023). Therefore, it is natural to investigate how to train a VLM from an off-the-shelf pretrained language model.

The popular shallow alignment methods represented by InstructBLIP (Li et al., 2023b) and MiniGPT-4 (Zhu et al., 2023) connect a frozen pretrained vision encoder and language model via a trainable Q-Former or a linear layer, mapping the image features into the input embedding space of the language model. This method converges rapidly, but its performance is noticeably inferior to that of LLaVA-1.5 with trainable language parameters, despite their model sizes and training datasets being almost identical.

The primary challenge in the performance of shallow alignment methods within VLMs can be attributed to the lack of deep fusion between visual and linguistic data. Shallow alignment methods struggle because they rely on ‘frozen’ language model weights, which are intrinsically trained to process text tokens. This presents a significant mismatch issue, as visual features lack a direct equivalent in the textual input space. Consequently, when these visual features undergo multi-layer transformations, they tend to deviate from the expected input distribution of the deeper language model layers. This misalignment is particularly evident in tasks like image captioning, where the specificity of a task – such as writing style and caption length – can only be superficially encoded into visual features through shallow methods.

A common strategy, as seen in PaLI (Chen et al., 2022b) and Qwen-VL (Bai et al., 2023), involves direct training of LLM during the pre-training or supervised fine-tuning (SFT) phase. However, this approach can compromise the models’ generalizability, particularly for tasks focused on textual outputs. Conventionally, LLMs are pretrained on extensive text-only datasets (Raffel et al., 2020), leading to a significant divergence in data distribution when compared to image-text pair datasets like LAION (Schuhmann et al., 2022) and COYO (Byeon et al., 2022). This shift often results in catastrophic forgetting, a phenomenon where the model’s proficiency in its original domain deteriorates. This issue is evident in Figure 4, which shows a marked decline in MMLU (Hendrycks et al., 2020) score as the model becomes more attuned to the LAION dataset, thus validating our hypothesis. This trend is not isolated; similar effects have been observed in models like PaLM-E (Driess et al., 2023) and Flamingo (Alayrac et al., 2022). For instance, adapting an 8B parameter language model for VLM pretraining can lead to an 87.3% reduction in natural language generation (NLG) performance (Driess et al., 2023).

The discussion above raises an important question: is it possible to retain the NLP capabilities of the large language model while adding top-notch visual understanding abilities to it?

CogVLM gives a “yes” answer. CogVLM instead adds a trainable visual expert to the language model. In each layer, the image features in the sequence use a new QKV matrix and MLP layer with the text features. Visual expert doubles the number of parameters while keeping the FLOPs the same. Since all the parameters in the original language model are fixed, the behaviors are the same as in the original language model if the input sequence contains no image. This inspiration arises from the comparison between P-Tuning (Liu et al., 2023f) and LoRA (Hu et al., 2021) in efficient finetuning, where p-tuning learns a task prefix embedding in the input while LoRA adapts the model weights in each layer via a low-rank matrix. As a result, LoRA performs better and more stable. A similar phenomenon might also exist in VLM, because in the shallow alignment methods, the image features act like the prefix embedding in P-Tuning.

Our contributions in this work are as follows:

We introduce the CogVLM model, which deeply integrates visual and linguistic features while retaining the full capabilities of a pretrained large language model. CogVLM-17B, trained from Vicuna-7B, achieves state-of-the-art across 17 classic cross-modal benchmarks.

Through extensive ablation studies, we validated the effectiveness of our proposed visual expert module and the importance of deep fusion. We further delved into multiple critical factors in multimodal pertaining, including the scale of visual encoder, variants of attention mask, the most impactful parameters in VLMs, and the necessity of incorporating self-supervised image loss, etc.

We have made the weights of CogVLM and the dataset used in the SFT phase available to the public. We anticipate that the open sourcing of CogVLM will significantly contribute to the research and industrial application of visual understanding.

Method

CogVLM model comprises four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module. Figure 4 shows an overview of the CogVLM architecture. The components’ design and implementation details are provided below:

ViT encoder. We utilize pretrained EVA2-CLIP-E (Sun et al., 2023b) in CogVLM-17B. Note that the final layer of ViT encoder is removed because it specializes in aggregating the [CLS] features for contrastive learning.

MLP adapter. To map the output of ViT into the same space as the text features from word embedding, we use an MLP adapter, a two-layer MLP (SwiGLU (Shazeer, 2020)). For implementation convenience, all image features share the same position id in the language model.

Pretrained large language model. CogVLM’s model design is compatible with any off-the-shelf GPT-style pretrained large language model. Specifically, CogVLM-17B adopts Vicuna1.5-7B (Chiang et al., 2023) for further training. A causal mask is applied to all the attention operations, including the attention between image features.

Visual expert module. We add a visual expert module to each layer to enable deep visual-language feature alignment. Specifically, the visual expert module in each layer consists of a QKV matrix and an MLP in each layer. The shapes of the QKV matrix and MLP are identical to those in the pretrained language model and initialized from them. The motivation is that each attention head in the language model captures a certain aspect of semantic information, while a trainable visual expert can transform the image features to align with the different heads, therefore enabling deep fusion.

where $W_{I},W_{T}$ are the QKV matrices of the visual expert and original language model, and Tril $(\cdot)$ means lower-triangular mask. The visual expert in FFN layers performs similarly,

where FFNI and FFNT are the FFN of the visual expert and original language model.

Position embedding. In the RoPE within LLM, we allow all visual tokens to share a single position id, as they already encapsulate positional information when inputted into the ViT. This approach mitigates the impact of remote attenuation between tokens in the LLM. Given that an image can occupy hundreds to thousands of tokens, and a typical input sequence is structured as ‘ query’, using conventional positional encoding would result in excessively lengthy encoding sequences. Moreover, it would lead the query to focus more on the image sequences closer to it, namely the lower part of an image.

2 Pretraining

Data. The image-text pairs for pretraining are all publicly available, including LAION-2B and COYO-700M. After removing the broken URLs, NSFW images, images with noisy captions, images with political bias and images with an aspect ratio $>6$ or $<1/6$ , about 1.5B images are left for pretraining.

We also crafted a visual grounding dataset of 40M images. Each noun in the image caption is associated with bounding boxes to indicate the positions in the image. The construction process basically follows (Peng et al., ), which extracts nouns via spaCy (Honnibal & Johnson, 2015) and predicts the bounding boxes using GLIPv2 (Zhang et al., 2022). The image-text pairs are sampled from LAION-115M, a subset of LAION-400M filtered by (Li et al., 2023b). We filter and retain a subset of 40 million images to ensure that over 75% of images contain at least two bounding boxes.

Training. The first stage of pretraining is for image captioning loss, i.e. next token prediction in the text part. We train the CogVLM-17B model on the 1.5B image-text pairs introduced above for 120,000 iterations with a batch size of 8,192. The second stage of pretraining is a mixture of image captioning and Referring Expression Comprehension (REC). REC is a task to predict the bounding box in the image given the text description of an object, which is trained in the form of VQA, i.e., Question: Where is the object? and Answer: $[[x_{0},y_{0},x_{1},y_{1}]]$ . Both $x$ and $y$ coordinates range from $000$ to $999$ , meaning the normalized position in the image. We only consider the loss of the next token prediction in the “Answer” part. We pretrain the second stage for 60,000 iterations with a batch size of 1,024 on the text-image pairs and visual grounding datasets introduced above. During the final 30,000 iterations, we change the input resolution from $224\times 224$ to $490\times 490$ . The total number of trainable parameters is 6.5B.

3 Alignment

In the instruction alignment phase, we trained two generalist models: CogVLM-Chat and CogVLM-Grounding. CogVLM-Chat accepts natural language inputs and outputs, while CogVLM-Grounding accepts inputs and outputs with bounding boxes.

CogVLM-Chat. In our study, we integrated data from a variety of open-source visual question-answering datasets, including VQAv2 (Antol et al., 2015), OKVQA (Marino et al., 2019), TextVQA (Singh et al., 2019), OCRVQA (Mishra et al., 2019), ScienceQA (Lu et al., 2022), as well as datasets formatted as multi-turn dialogues such as LLaVA-Instruct (Liu et al., 2023c), LRV-Instruction (Liu et al., 2023a), LLaVAR (Zhang et al., 2023). We then conducted unified instruction-supervised fine-tuning (SFT) across these diverse datasets. The integrity and quality of SFT data are crucial; notably, the LLaVA-Instruct dataset, initially generated through a language-only GPT-4 pipeline, contained certain inaccuracies. We meticulously corrected these errors through manual inspection and annotation to ensure data quality.

VQA datasets typically feature concise, often one-word answers, contrasting with the dialogue datasets that provide detailed responses with extensive reasoning. To accommodate this variability, we employed prompts formatted as Question: Short answer: for concise responses and Question: Answer: for extended discourse in the SFT phase.

During training, the model underwent 6000 iterations with a learning rate of 1e-5 and a batch size of 1024. To enhance and ensure the stability of the training, we activated the visual encoder’s parameters and adjusted its learning rate to be one-tenth of that used for the remaining training parameters.

CogVLM-Grounding. In order to endow our model with consistent, interactive visual grounding capabilities, we collect a high-quality dataset covering 4 types of grounding data: (1) Grounded Captioning (GC) - image captioning datasets where each noun phrase within the caption is followed by the corresponding referential bounding boxes; (2) Referring Expression Generation (REG) - image-oriented datasets that each bounding box in the image is annotated with a descriptive textual expression that accurately characterizes and refers to the content within the specific region; (3) Referring Expression Comprehension (REC) - text-oriented datasets that each textual description is annotated with multiple referential links associating the phrases with corresponding boxes; (4) Grounded Visual Question Answering (GroundedVQA) - VQA-style datasets where the questions may contain region references in a given image. The sources of grounding data are all publicly available, including Flickr30K Entities (Plummer et al., 2015), RefCOCO (Kazemzadeh et al., 2014; Mao et al., 2016; Yu et al., 2016), Visual7W (Zhu et al., 2016), VisualGenome (Krishna et al., 2017) and Grounded CoT-VQA (Chen et al., 2023a). $[box]$ in this section is in the format of $[[x_{0},y_{0},x_{1},y_{1}]]$ .

It is noteworthy that the curated datasets exhibit a versatility of visual grounding capabilities, and many datasets can be adapted and repurposed across different tasks. For instance, grounded captioning datasets can be reformulated to suit REG and REC tasks. Taking the example of “A man $[box_{1}]$ and a woman $[box_{2}]$ are walking together.”, this can be reframed into question answering pairs like (“Describe this region $[box_{2}]$ .”, “A woman.”) and (“Where is the man?”, “ $[box_{1}]$ ”). Similarly, REC datasets can be translated into REG tasks by switching the input and output, and vice versa. However, certain conversions might lead to ambiguities. For example, when presented with the isolated query “Where is another man?” from the caption “A man $[box_{1}]$ is running, while another man $[box_{2}]$ is looking.”, the distinction between $[box_{1}]$ and $[box_{2}]$ becomes unclear, potentially leading to errors.

Experiments

To rigorously validate the superior performance and robust generalization of our base model, we conduct quantitative evaluations on an array of multi-modal benchmarks. These benchmarks can be categorized into three broad areas covering a comprehensive range of measurementDetailed summary of all benchmarks and corresponding metrics are available at Appendix A.2.:

Image Captioning. The main purpose of these tasks is to generate textual captions summarizing the major content of a given image. We utilize prominent datasets including NoCaps (Agrawal et al., 2019), COCO (Lin et al., 2014), Flickr30K (Plummer et al., 2015), and TextCaps (Sidorov et al., 2020) for evaluation.

Visual Question Answering. The VQA tasks require models to answer questions that may focus on distinct visual contents based on the given image. Our assessment covers diverse datasets, including VQAv2 (Antol et al., 2015), OKVQA (Marino et al., 2019), TextVQA (Singh et al., 2019), OCRVQA (Mishra et al., 2019) and ScienceQA (Lu et al., 2022).

LVLM Benchmarks. LVLM benchmarks are primarily employed to assess the advanced capabilities of large multimodal models, such as object recognition and localization, OCR, visual description, and visual knowledge reasoning. We conduct multidimensional evaluations of the models on datasets including MM-Vet (Yu et al., 2023), MMBench (Liu et al., 2023g), SEED-Bench (Li et al., 2023a), LLaVA-Bench (Liu et al., 2023c), POPE (Li et al., 2023c), MMMU (Yue et al., 2023) and MathVista (Lu et al., 2023).

Visual Grounding. Visual grounding involves a set of tasks that establish referential links between textual mentions in a sentence and specific regions in an image. We evaluate our model on the typical datasets, including Visual7w (Zhu et al., 2016), RefCOCO (Liu et al., 2017), RefCOCO+, and RefCOCOg to ensure completeness.

We evaluate the image captioning capability of our pretrained base model on the aforementioned four benchmarks. In a zero-shot evaluation on the Nocaps and Flickr datasets, we assess the precision of our model in describing long-tail visual concepts. Additionally, we present results from finetuning on the COCO and TextCaps datasets.

The detailed performance is shown in Table 1. Overall, our model achieves the SOTA or compatible performance across the board. Specifically, on the NoCaps benchmark, our base model outperforms the previous best method, GIT2, across four splits with a maximum of $5.7$ points in the out-domain set while only consuming 10% of the pretraining data (1.5B vs 12.9B). On the Flickr benchmark, our model achieves a SOTA score of $94.9$ surpassing the concurrently released Qwen-VL model by $9.1$ points. These results demonstrate the remarkable capability and robustness of our pretrained model on the image captioning task. We also evaluate our model on the COCO (Lin et al., 2014) and TextCaps, where the latter is specifically designed to integrate the textual information of the given image into captions. Though training without the dedicated OCR data, encouragingly, our base model reveals a significant text-reading ability and obtains a competitive performance with PaLI-X-55B, and outperforms the previous best model of the same scale, PaLI-17B, by $9.1$ points score.

2 Visual Question Answering

As illustrated in Table 2, our CogVLM model demonstrates outstanding performance and a significant lead over models of similar parameter scale across a variety of tasks, including daily-life image question-answering dataset VQAv2, text-intensive image question-answering datasets such as TextVQA and OCRVQA, and knowledge-demanding datasets like OKVQA and ScienceQA. This success showcases the model’s robust generalization capabilities and potential across diverse domains.

3 LVLM Benchmarks

Our findings, detailed in Table 2, demonstrate that CogVLM achieved state-of-the-art results in all 7 LVLM-benchmarks, markedly surpassing all other models. It also outperformed multimodal models that utilized larger language models, such as LLava1.5 with Vicuna-13B and Emu-2 with LLAMA-33B, leading by 15.7 and 2.6 points on MM-vet, 9.9 and 14.0 points on MMBench, respectively. Compared to IDEFICS-Instruct trained on LLaMA-65B, CogVLM’s scores exceeded by 19.3, 23.1, and 20.9 points on Seed-Bench, MMBench, and LLaVA-Bench, respectively. Furthermore, CogVLM achieved a score of 41.1 on the MMMU dataset, and also scored 87.9 on the hallucination assessment dataset POPE, along with 35.2 on the multimodal mathematical reasoning benchmark MathVista. These impressive results not only showcase its robust reasoning abilities and multi-task generalization capabilities but also clearly demonstrate that CogVLM is significantly outpacing other models in these domains. Notably, shallow fusion models such as InstructBLIP and MiniGPT-4 underperformed across most benchmarks, despite InstructBLIP’s extensive training on instructional data, underscoring the necessity of deep fusion for enhanced performance.

4 Visual Grounding

Table 3 shows the result on the standard visual grounding benchmarks. We find that our generalist model achieves state-of-the-art performance across the board, with a significant advantage over the previous or concurrent models. As shown in the bottom part of Table 3, our model even surpasses models that are specifically trained for individual tasks, achieving SOTA performance on 5 of 9 splits. For instance, in the RefCOCO val subset, our model attains a score of 92.76, surpassing UNINEXT-H’s 92.64; in the RefCOCO+ test-A subset, it scores 92.91, exceeding ONE-PEACE’s 92.21; and in the RefCOCOg test subset, it achieves 90.79, outperforming UNINEXT-H’s 89.27. These results suggest a remarkable visual grounding capability of our model incorporating our training paradigm.

5 Ablation Study

To understand the impact of various components and settings on our model’s performance, we conduct an extensive ablation study for 6,000 iterations and a batch size of 8,192. Table 4 summarizes the results about the following aspects:

Model structure and tuned parameters. To investigate the effectiveness of CogVLM’s model, we conduct ablation studies on several structure variants and tuning strategies, including: 1) tuning only the MLP Adapter layer; 2) tuning all LLM parameters and the Adapter without adding visual expert; 3) only adding visual expert at every 4th LLM layer; and 4) only add visual expert to FFNs at all layers.

From the results, we can see that shallow vision-language alignment, i.e. only tuning the adapter layer (similar to the method used in BLIP-2), results in a significantly inferior performance. Also, the performance of training the visual expert is higher than that of training the LLM, especially on the datasets that require external knowledge, even though the training parameters are roughly the same. We also compare with other variants of adding visual expert, including a. inserting an expert module every 4 layers and b. removing the attention part from the expert. Both of them result in a certain degree of performance decline, but within an acceptable range, which provides some guidance for balancing computational overhead and model performance.

Initialization Method. As for visual expert’s initialization method, we compare initialization with weights from LLM to random initialization. Our results across various datasets demonstrate that initialization with LLM’s weights consistently achieves superior performance. This indicates that the transformer architecture pre-trained on language data possesses a certain capability to process visual tokens. Moreover, it can serve as a more effective starting point for multimodal pre-training initialization.

Visual Attention Mask. We empirically find that using a causal mask on visual tokens yields a better result in comparison with a full mask. This is slightly counterintuitive, as using a bidirectional attention mask allows access to more information than a causal mask. We hypothesize the possible explanation for this phenomenon is that the causal mask better fits the inherent structure of LLMs.

Image SSL Loss. We also investigated the self-supervised learning loss on image features, where each visual feature predicts the CLIP feature of the next position for visual self-supervision. Align with the observation from PaLI-X (Chen et al., 2023b), we find it brings no improvement on downstream tasks, although we indeed observed improvements in small models in our early experiments.

Visual Encoder. we substituted the 300M-parameter EVA2-L model for the 4.4B-parameter EVA2-E to investigate the impact of visual encoder parameters on various tasks. The results indicated that there was only a slight decrease in performance across most benchmarks. However, a notable exception was observed in the text-oriented dataset TextVQA, where we recorded a decline of 2.5.

EMA. We utilize EMA (Exponential Moving Average) during pretraining. The ablation results show that EMA often brings improvements across various tasks compared to not using it.

Conclusion

In this paper, we introduce CogVLM, an open visual language foundation model. CogVLM shifts the paradigm for VLM training from shallow alignment to deep fusion, achieving state-of-the-art performance on 17 classic multi-modal benchmarks.

The VLM training is still in its infancy, and there are many directions to explore, for example, better SFT alignment, RLHF and anti-hallucination. Since the previous famous VLMs are mostly closed-source, we believe CogVLM will be a solid foundation for future multi-modal research.

References

Appendix A Appendix

We report the details of parameter settings during pre-training and multitask training in Table 5 and Table 6.

A.2 Details of Associated Datasets

In this section, we introduce the details of datasets and their use in our evaluation process for all associated benchmarks.

COCO (Lin et al., 2014) The Captions in COCO dataset are collected using Amazon’s Mechanical Turk (AMT) workers who are given instructions to control the quality. The dataset contains 330K images, where the train, validation and test sets contain 413,915 captions for 82,783 images, 202,520 captions for 40,504 images, and 379,249 captions for 40,775 images respectively.

NoCaps (Agrawal et al., 2019). NoCaps is a large-scale benchmark for novel object captioning, containing nearly 400 novel object classes compared to COCO. The validation and test set comprised of 4,500 and 10,600 images, respectively, sourced from the Open Images (Krasin et al., 2017) and annotated with 11 human-generated captions per image, and each set is subdivided into three domains: “in”, “near”, and “out”, with objects in the “out-domain” never appearing in the COCO dataset.

Flickr30K (Plummer et al., 2015). Flickr30K is a high-quality dataset consists of 31,783 images of everyday life activities, envets and scenes (all harvested from the online website Flickr) and 158,915 captions (obtained via crodsourcing). Each image in this dataset is described independently by five annotators who are not familiar with the specific entities and circumstances depicted in them.

TextCaps (Sidorov et al., 2020) Textcaps is a dataset with 145k captions for 28k images. The design purpose of the TextCaps dataset is to effectively integrate textual information with visual context into captions, requiring the model to have both excellent OCR capabilities and strong captioning abilities.

A.2.2 General VQA

VQAv2 (Antol et al., 2015) VQAv2 encompasses over 200,000 images, paired with more than 1.1 million questions that have collectively garnered over 11 million answers. Questions span various types, including yes/no, counting, and open-ended queries.

OKVQA (Marino et al., 2019) The OK-VQA (Outside Knowledge Visual Question Answering) dataset is specifically designed to probe visual question answering capabilities that necessitate external knowledge or common sense beyond image content. It has 14,055 open-ended questions and 5 ground truth answers per question.

ScienceQA (Lu et al., 2022) The ScienceQA dataset comprises 21,208 multimodal multiple-choice questions spanning three diverse subjects: natural science, language science, and social science. Each question is annotated with explanations linked to relevant lectures.

TDIUC (Shrestha et al., 2019) The TDIUC dataset features 1.6M questions across 170K images from MS COCO and Visual Genome. Categorized into 12 distinct question types, it ranges from basic tasks like identifying objects or colors to more advanced reasoning like counting or positional discernment.

A.2.3 Text-oriented VQA

OCRVQA (Mishra et al., 2019) OCR-VQA consists of 207,572 book cover images with over 1 million question-answer pairs.

TextVQA (Singh et al., 2019) TextVQA is a dataset with 45,336 questions on 28,408 images that challenges models to detect, read, and reason about text within images to provide answers.

A.3 LVLM Benchmarks

MM-Vet (Yu et al., 2023) MM-Vet defines six core VL capabilities and examines 16 integrations of interest derived from the combinations of these capabilities. It employs an evaluator based on LLMs for open-ended outputs, capable of assessing across different question types and answer styles, thus deriving a unified scoring metric.

SEED-Bench (Li et al., 2023a) SEED-Bench is a dataset comprising 19K multiple-choice questions with precise human annotations, covering 12 evaluation dimensions, including understanding of image and video modalities. It obtains accurate answer options through manual annotations, enabling objective and efficient assessment of model performance.

MMBench (Liu et al., 2023g) MMBench comprises approximately 3000 multiple-choice questions, covering 20 different capability dimensions, aimed at evaluating various abilities of visual-language models. MMBench adopts a hierarchical capability dimension structure, including two high-level capability dimensions: perception and reasoning, as well as fine-grained capability dimensions such as object localization and attribute inference.

LLaVA-Bench (Liu et al., 2023c) LLaVA-Bench (In-the-Wild) is a benchmark dataset comprising 60 questions, designed to evaluate the multimodal instruction following capabilities of LMMs. It includes indoor and outdoor scenes, memes, paintings, sketches, etc., and is equipped with highly detailed, manually curated descriptions and appropriate question selections.

POPE (Li et al., 2023c) The POPE dataset is a binary classification query dataset specifically designed to evaluate object hallucination issues in LMMs. The random, popular, and adversarial subsets within the POPE dataset are constructed through different sampling strategies, totaling 8,910 entries.

MMMU (Yue et al., 2023) The MMMU dataset is a large-scale, multidisciplinary multimodal understanding and reasoning benchmark set, containing 11.5K questions. It covers 6 major disciplines, 30 topics, and 183 subfields, with question types including multiple-choice and open-ended questions. The dataset includes 30 types of images, such as charts, tables, chemical structures, photographs, paintings, musical scores, etc., testing the multimodal perception capabilities of models and their performance in expert-level tasks.

MathVista (Lu et al., 2023) MathVista is a new benchmark dataset that combines mathematical and visual understanding, comprising 31 existing multimodal datasets and 3 newly created datasets, totaling 6141 examples. These datasets encompass a diverse range of mathematical reasoning abilities, including seven types: algebra, arithmetic, geometry, logic, numerical common sense, science, and statistics. The goal is to comprehensively evaluate the capabilities of existing foundational models in mathematical reasoning and visual understanding.

RefCOCO/RefCOCO+ (Liu et al., 2017) RefCOCO and RefCOCO+ evolved from the ReferItGame. Both subsets focus on images with two or more similar objects. RefCOCO, with 142,209 expressions across 19,994 images, places no linguistic constraints. Conversely, RefCOCO+ emphasizes appearance-centric descriptions, omitting locational terms, and comprises 141,564 expressions over 19,992 images.

RefCOCOg (Mao et al., 2016) The RefCOCOg subset was amassed through Amazon Mechanical Turk, where workers penned natural referring expressions for objects in MSCOCO images; it boasts 85,474 referring expressions spanning 26,711 images, each containing 2 to 4 objects of the same category.

Visual7W (Zhu et al., 2016). The Visual7W dataset is predominantly designed for VQA tasks, with a dedicated subset crafted for grounded VQA. In this subset, models are presented with an image accompanied by a “which”-type question, such as “Which is the small computer in the corner?”. Participants are then given four bounding boxes within the image, from which they must select the correct one as the answer. The grounded Visual7W part consists of 25,733 images and 188,068 questions.

Flickr30K-Entities (Plummer et al., 2015). The Flickr30K Entities dataset, a precursor in the realm of grounded captioning, encompasses a collection of 31,783 images accompanied by 158k captioning annotations. Every caption in this dataset has been meticulously annotated such that each noun phrase is linked with a manually delineated referential bounding box. In total, there are 276k such annotated bounding boxes provided within this dataset.

VisualGenome (Krishna et al., 2017). The VisualGenome dataset stands as a cornerstone in understanding the multifaceted relationships present within images. With a collection of over 100k images, each image is annotated in detail, capturing an average of 21 objects, 18 attributes, and 18 inter-object relationships. A unique aspect of this dataset is the alignment of objects, attributes, relationships, and region descriptions with standardized terminologies from WordNet. Specifically tailored for the REG and REC tasks, each annotated region in an image comes with a corresponding descriptive text, making it a rich resource for image understanding and semantic modeling. We use the subset with around 86k images and 3.6 million region-caption pairs for visual grounding.

Appendix B Additional Fine-grained Experiments

To comprehensively investigate the proposed model on specific topics and question types, we further conduct extensive experiments on a representative benchmark, TDIUC (Kafle & Kanan, 2017). We use the publicly available split of val set as evaluation data, and the VQA accuracy calculated from their official scripts as the evaluation metric.

The experimental results on TDIUC compare our model against the specialist SOTA method MUREL (Cadene et al., 2019) are shown in Figure 5. From the experimental result, we can see that our model consistently outperforms the previous model on 12 specific question types, resulting in a $94.0$ accuracy score compared to the previous SOTA of $88.2$ on the overall dataset. These results demonstrate that our model exhibits comprehensive problem-solving skills on general VQA tasks.

Appendix C Computational Efficiency

In this section, we compare the computational efficiency of our model with other state-of-the-art models, considering both pretraining and finetuning data from datasets such as VQAv2 and TextVQA. Owing to an optimized architecture and the utilization of high-quality pretraining data, our model demonstrates a marked reduction in resource consumption during training relative to models with comparable parameter magnitudes.