ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin

Introduction

Recent breakthroughs in artificial intelligence have been driven notably by the development of large language models (LLMs) . Following the evolution, modality unification via LLMs becomes the inevitable tendency, and visual-aligned multi-modal LLMs have witnessed ever-changing advances in recent days. Putting aside the diversity in model architecture and training data, most of the large multi-modal models (LMMs) adhere to a dual-phase paradigm encompassing a pre-training stage with large-scale image-text pairs for modality alignment, followed by a supervised fine-tuning (SFT) stage that enhances multi-modal capabilities through instruction-format data.

Despite their efforts and achievements, we argue that the current LMMs still align the modalities in a sub-optimal manner, primarily due to the lack of sufficient high-quality image-text pairs. Vision, inherently rich in information and fine-grained semantics, is often reduced to simplistic captions in mainstream image-text datasets. These captions, typically brief and focused on salient objects, lead to a significant reduction in information content and sub-optimal modality alignment.

To prove our argument, we conducted a straightforward experiment: we substituted the image-text pairs utilized in the SFT stage of several typical LMMs with equivalent comprehensive captions generated by the advanced GPT4-Vision model and re-benchmarked these LMMs. As shown in Figure 2, such equivalent substitution, despite its relatively minimal extent (only 3.5% of the SFT data in the LLaVA-1.5 case), resulted in consistent performance gains across various LMMs and benchmarks. Encouraged by these promising results, we expanded our efforts to collect high-quality captions on a larger scale, involving two phases. In the initial phase, approximately 100K images from various data sources were gathered. We employed carefully designed data-specific prompts to effectively utilize GPT4-Vision to generate high-quality descriptions. The resulting captions, averaging 942 characters, encompass a comprehensive range of image information, such as world knowledge, object properties, spatial relation, aesthetic evaluation, etc. In the second phase, we utilize these captions to build a strong caption model, which gets rid of the data source specialized prompt and could generate comprehensive captions for given images.

Based on the above endeavors, we introduce the ShareGPT4V dataset, the first highly descriptive image-text collection. It comprises two components: 100K GPT4-Vision generated captions with diverse image sources and 1.2M captions crafted by our caption model, which is learned from the 100K high-quality captions. With the aid of this dataset, we have developed an eponymous state-of-the-art large multi-modal model, the ShareGPT4V-7B. To maintain clarity in our discourse, ‘dataset’ or ‘model’ will be distinctly specified when referring to ShareGPT4V. Figure 1(b) shows that ShareGPT4V-7B outperforms other advanced 7B-scale LMMs in all 11 benchmarks, showcasing its competitive performance. For instance, our ShareGPT4V-7B model achieves an impressive total score of 1943.8 on the MME benchmark, surpassing the second-ranked Qwen-VL-Chat-7B model, which was trained on 1.4 billion samples, by 95.6 points.

In a nutshell, our contributions are threefold:

We point out the fact that existing low-quality captions can impede the alignment between vision and language modalities of LMMs and we verify it with experimental results. This revelation highlights an urgent requirement within the LMM community for high-quality captions to effectively alleviate such a dilemma.

We introduce the ShareGPT4V dataset, a large-scale image-text collection featuring 100K highly descriptive captions generated by GPT4-Vision and 1.2M high-quality captions generated by our caption model. The caption covers world knowledge, object attributes, spatial relations, aesthetic assessment, etc. Moreover, the general caption model trained on entire GPT4-Vision-generated captions could further scale our dataset and will also be available for community usage.

Leveraging the proposed dataset, we have developed the ShareGPT4V-7B, an advanced large multimodal model. Despite without elaborate architecture design, this model consistently demonstrates impressive performance across various multi-modal benchmarks.

Related Work

Large Language Models. In recent years, with the surge in data and computational power, the development of large language models has experienced a boom. Early encoder-decoder models like BERT and T5 , and decoder-centric models such as GPT , leveraged the Transformer architecture to excel in various NLP tasks. The success in GPT3 has popularized the use of decoder-only architectures, which rely on auto-regressive decoding for generating predictions. Subsequent models like PaLM extended the limits of model parameters and dataset scale, while others like InstructGPT and ChatGPT introduced fine-tuning and reinforcement learning techniques for improved conversational interaction. These developments, along with contributions from the open-source community , have set new benchmarks and opened avenues for future research in NLP area.

Large Multi-modal Models. As LLMs rapidly evolve, a faction within the research community is increasingly concentrating on introducing visual knowledge into LLMs. Central to this area are the seminal works in modality alignment within the vision-language learning area . A notable instance is CLIP , which exemplifies the alignment of visual and textual modalities through contrastive learning on extensive image-text pairs. A series of works were improved upon CLIP by employing refined data strategies for more diverse data, they have been effective for basic visual tasks but less so for complex tasks like visual question answering. MiniGPT-4 , leveraging an LLM and a visual encoder , has shown proficiency in image-text dialogues through pre-training alignment and instruction fine-tuning. Subsequent research has further enhanced LMMs by focusing on the quality and diversity of pretraining and fine-tuning data. For instance, LLaVA and InstructBLIP , with improved instruction fine-tuning, have advanced the understanding of complex prompts. mPLUG-Owl , Shikra , and KOSMOS-2 have introduced new data types and training techniques, like grounding data, to reduce hallucinations and improve LMMs’ grounding capability. Regrettably, it appears that the current LMMs have somewhat overlooked a crucial element: the quality of captions in image-text pairs.

Image-text Data Enhancement. In the vision-language learning area, several initiatives have been undertaken to enhance the quality of captions within image-text pairs. LaCLIP leverages LLMs to rewrite raw captions, but its effectiveness is often hindered by hallucinations due to limited visual information and the low quality of original captions. Research explores methods to filter and blend raw and synthetic captions to enhance the CLIP model. A recent work, VeCLIP , proposes using LLMs to amalgamate information from both raw and synthetic captions. Nevertheless, the approach is constrained by the low quality of synthetic captions, resulting in only minimal incorporation of visual knowledge in the caption fusion process. To the best of our knowledge, in the LMM area, LLaVA uniquely inputs human-annotated short captions and bounding boxes into the GPT4 language model. This approach lets the model ‘imagine’ viewing the image before producing detailed captions. However, this method relies heavily on extensive human-annotated data and does not allow the model to truly ‘see’ the images. Consequently, it tends to generate detailed descriptions primarily of main objects, often including those in obscure corners but annotated with bounding boxes, leading to potential hallucinations in the LMMs’ output. In contrast, we employ the most advanced LMM, GPT4-Vision, which is capable of directly producing highly descriptive captions from deliberated prompts and corresponding image inputs.

ShareGPT4V Dataset

In this section, we provide a detailed exposition of the process involved in creating the ShareGPT4V dataset. Subsection 3.2 elaborates on how we utilized GPT4-Vision to generate 100K high-quality captions from various image sources and briefly validates their significant role in the SFT phase of LMMs. Subsection 3.3 describes our methodology for reasonably expanding the 100K high-quality captions in Sec.3.2 to 1.2M captions, matching the quality generated by GPT4-Vision with acceptable cost. Table 1 presents a comparison between our dataset and existing widely-used caption datasets in the LMM field. Our ShareGPT4V dataset stands out due to its more diverse range of image sources, the use of a more advanced caption producer, a larger number of samples, and the generation of longer captions.

2 ShareGPT4V Data Collection

The supervised fine-tuning captions were collected from GPT4-Vision, the latest and most advanced LMM. For each image selected from a specific data source $D$ , we employed a meticulously crafted, data-specific prompt $P_{D}$ . This prompt instructed GPT4-Vision to generate detailed descriptions, taking into account factors such as world knowledge, object attributes, spatial relationships, and aesthetic evaluations.

Data sources. To maximize the diversity and comprehensiveness of our data, we compiled around 100K images from various data sources, including images for detection and segmentation , complex text-containing images, as well as various web images containing artworks, landmarks, celebrities etc. More details could be found in the supplementary material.

Prompt Design. Given the diversity of our image sources, we expect a highly content-related description for each image. That is, the captions should extend beyond mere appearance and attributes, incorporating knowledge-related information. For instance, the Eiffel Tower should not be simply described as a tall iron tower, and a picture of Einstein should not be concluded as an old man.

For the description quality and stability, we designed a base prompt for a general description and added a specialized prompt for each data source. The base prompt asks the GPT4-Vision to describe the basic information of the image, including the object attributes, appearance, and spatial relationships. The specialized prompt focuses on some data-related information, as shown in Figure 3, we emphasize that the GPT4-Vision should mention some corresponding knowledge, such as the name and geographical location of a landmark-related image. Additionally, we add an aesthetic-related prompt for part of the images, to further improve the comprehensiveness of the description.

Quality Verification. We conducted a straightforward experiment to verify the quality of the collected data: we chose a range of advanced, publicly available LMMs, including LLaVA-7B , LLaVA-1.5-7B , LLaVA-1.5-13B , and Qwen-VL-Chat-7B . For a fair comparison, we replaced a corresponding portion of detailed captions in their Supervised Fine-Tuning (SFT) datasets with a selection from our 100K GPT4-Vision-generated captions, while maintaining image data sources as consistent as possible. As depicted in Figure 2, the integration of our highly descriptive captions significantly improved the SFT phase performance across these varied LMMs, reinforcing our pursuit to gather more high-quality captions for potential benefits in the pretraining stage.

3 ShareGPT4V-PT Data Generation

Compared with the supervised fine-tuning stage, modality alignment in the pre-training phase is more crucial and demands an large-scale dataset. For building a pre-training dataset, we employed the 100K high-quality captions generated by GPT4-Vision to fine-tune an alternative caption model and we have named it as Share-Captioner. Thanks to its training on diverse and comprehensive data, the Share-Captioner is capable of generating highly content-related descriptions with unified instruction. This approach allows the data scaling phase to proceed without the need for specialized prompt design.

To amass a substantial volume of high-quality image-text pairs, we selected a subset of 1.2 million images from current public datasets (see supplementary material for more details) and employed our pre-trained Share-Captioner for the captioning process. The entire caption generation process required around 44 A100 GPU days and we name this part of data as ShareGPT4V-PT.

Qualitative Analysis. For qualitative analysis, Figure 4 presents caption results from human-made COCO-Captions , BLIP , LLaVA-1.5-7B , Share-Captioner, and GPT4-Vision. It is important to note that the images featured in this figure were not part of the training dataset for Share-Captioner. The results depicted in Figure 4 demonstrate that Share-Captioner produced results that are closely comparable to those generated by GPT4-Vision, aligning with our anticipated capabilities for the captioning process.

Quantitative Analysis. As detailed in Table 2, we generate 100 captions with GPT4-Vision and our Share-Captioner, and invite 10 volunteers to select the better one. As anticipated, our Share-Captioner performs on par with the GPT4-Vision, confirming the quality of the ShareGPT4V dataset.

ShareGPT4V-7B Model

To ascertain the efficacy of the ShareGPT4V dataset, we conducted experiments within a fair and controlled setting. This led to the development of ShareGPT4V-7B, a streamlined yet superior baseline LMM leveraging the high-quality data from the ShareGPT4V dataset in both the pre-training and SFT stages.

The ShareGPT4V-7B model follows the design of LLaVA-1.5 , including three integral components: (1) A vision encoder utilizing the CLIP-Large model , with a resolution of 336 $\times$ 336 and a patch size of 14, converting input images into 576 tokens. (2) A projector, which is a two-layer multi-layer perception (MLP), is introduced to connect the vision and language modalities. (3) A LLM, based on the open-source Vicuna-v1.5 , derived from LLaMA2 . Currently, our focus is on the lightweight 7B model scale, and we have empirically validated that even with lightweight training data and model scale, it can significantly outperform many current LMMs that utilize extensive training datasets or larger model scales.

2 Pre-Training

In the pre-training stage, we utilize the pre-training subset of the ShareGPT4V dataset, i.e., ShareGPT4V-PT. Given these high-quality captions, solely fine-tuning the MLP does not suffice to exploit their full capabilities. In previous LMM research , the vision encoder is generally not fine-tuned during pre-training, a rational approach considering the lower quality of previously used captions, where fine-tuned the vision encoder might degrade its visual knowledge extraction ability. We opted for simultaneous fine-tuning of the vision encoder, projector, and large language model. With this configuration, the large language model acquires a native understanding of visual embeddings, while also prompting the vision encoder to create relevant visual embeddings for elements in captions. This setup enables a comprehensive exploration and understanding of the knowledge embedded in visual embeddings, aligned with the intricate details of the captions. Specifically, we consistently applied a learning rate of $2e^{-5}$ across all components, with a batch size set at $256$ , and the comprehensive optimization process spanned roughly $4700$ steps. Notably, we experimentally found that selectively fine-tuning only the latter half of the vision encoder’s layers achieves optimal results, coupled with a satisfactory level of training efficiency.

3 Supervised Fine-Tuning.

As we emphasized above, the goal of this paper is not to build a new SOTA model with some unique architecture designs but to investigate the effectiveness of high-quality captions to realize better modality alignment of LMMs. So we utilize the 665k supervised data organized by LLaVA-1.5 and only replace part of it with our ShareGPT4V dataset. In detail, the 665k data is gathered from publicly available academic task-oriented data and instruction-tuning data for conversational and complex reasoning tasks involving natural images . It contains 23k detailed description data and we replaced it with randomly sampled 23K high-quality captions from the 100K captions in ShareGPT4V. During the SFT stage, to enhance the training efficiency and compare fairly, we froze the vision encoder and instead focused on fine-tuning the projector and the large language model. The learning rate was established at $2e^{-5}$ , with a batch size of $128$ , and the total optimization process spanned around $5200$ steps.

Experiments

To thoroughly assess our proposed ShareGPT4V-7B model, we evaluate it across 11 benchmarks, covering a range of academic Visual Question Answering (VQA) tasks and recent benchmarks designed specifically for large multi-modal models (LMMs). The LLaVA (in the wild) benchmark is composed of 60 questions, spanning three distinct tasks: conversation, complex reasoning, and detailed description. The MME Benchmark evaluates LMMs’ perception and cognition capabilities through a series of carefully crafted questions across 14 sub-tasks. MMBench and MMBench-CN benchmarks manually design questions to evaluate the model’s vision-related reasoning and perception abilities for English and Chinese, respectively. SEED , with the assistance of GPT4, generated a dataset comprising approximately 19K questions related to images and videos. MM-Vet uses GPT4 for a six-dimensional LMM capability assessment. Q-Bench assesses low-level perception, while VQA-v2 and VisWiz are benchmarks in the realm of traditional Visual Question Answering (VQA) tasks.

2 Quantitative Comparison

We present a quantitative comparison between our proposed ShareGPT4V-7B model with existing state-of-the-art LMMs. Notably, compared with previous LMMs, our ShareGPT4V-7B attained the most superior performance in 9 out of the total 11 benchmarks.

Specifically, our ShareGPT4V-7B model outperformed the previously best-performing LLaVA-1.5-13B model by 1.9 points on the LLaVA (in the wild) benchmark, demonstrating superior capabilities in tasks such as detailed description and complex reasoning. On the MME Benchmark, it achieved the highest scores in both perception (P) and cognition (C) capabilities, surpassing LLaVA-1.5-13B in perception by 36.1 points and exceeding Qwen-VL-Chat, which was trained on 1.4 billion data, by 15.7 points in cognition. Our model also achieved an optimal accuracy of 68.8 $\%$ on MMBench, leading the second-best by 1.1 $\%$ . Furthermore, on the SEED (image) benchmark, which includes 9 assessment dimensions and 14K questions, ShareGPT4V-7B achieved the highest score of 69.7 $\%$ , 1.5 $\%$ higher than the second-ranked LLaVA-1.5-13B. In the low-level image assessment QBench, our model’s top score of 63.4 $\%$ can be attributed to the diversity of our constructed dataset. Lastly, our model almost consistently performed best on traditional VQA benchmarks with the smallest model size.

Our findings demonstrate to the community that even with a simple architecture, public data, and lighter parameters (7B), it is possible to outperform many competitors with massive training data and parameter sizes, thanks to the support of these high-quality captions.

3 Multi-modal Dialogue

In Figure 5, we present two representative examples within multi-modal dialogue scenarios. The figure demonstrates that our ShareGPT4V-7B exhibits satisfactory capabilities in understanding image details and performing aesthetic assessments. This further corroborates the significance of the high-quality captions we have collected.

4 Ablations

Effectiveness of ShareGPT4V Dataset. As shown in Table 4, we conducted a thorough ablation study to assess the impact of the ShareGPT4V-PT and ShareGPT4V subsets. Our baseline is the LLaVA-1.5-7B model, without utilizing the ShareGPT4V dataset in either pretraining or SFT stages. Utilizing only our ShareGPT4V subset during the SFT stages resulted in a significant increase of 31.4 points in MME perception score, and improvements of 2.5 $\%$ and 0.5 $\%$ in accuracy on the MMBench and SEED benchmarks, respectively. Notably, ShareGPT4V used here was selected from various data sources, yielding more performance gains than those from solely the COCO dataset (see in Figure 2). When only the ShareGPT4V-PT subset was used during pretraining, we observed a remarkable gain of 46.5 points in MME perception, along with substantial accuracy improvements of 3.1 $\%$ and 2.3 $\%$ on the MMBench and SEED benchmarks, respectively. Moreover, employing the ShareGPT4V dataset in both pretraining and SFT phases led to further satisfactory enhancements in overall performance, effectively validating the necessity of incorporating high-quality captions in both training stages.

Pre-training Caption Quality. Then we study how the caption quality influences the pre-training performance. For a fair comparison, we pre-train the model with the same setting and images, but the captions are generated by different models. In detail, we use the 558K LAION-CC-SUB image-text pairs captioned by the BLIP as the baseline and replace the text with the high-quality one in our ShareGPT4V-PT.

As results shown in Table 5, comparing with the baseline, the joint training strategy with the BLIP-558K data gets better results on all the benchmarks, while the gain is quite minor that only $4.7$ in MME Perception and $0.1$ on SEED Bench. When we replace the captions with our ShareGPT4V-PT-558K, the model gets significant gains. In detail, it gets $1549.8$ , $68.3$ , $68.9$ on the three benchmarks, surpassing the BLIP-558K case with $18.2$ , $1.9$ and $2.0$ respectively. This proves the essential of high-quality captions for effective pre-training and modality alignment.

Number of Captions in Pre-training. In Figure 6, we present our investigation into the required quantity of high-quality captions for the pre-training stage. Here we randomly sample the data from the ShareGPT4V-PT and train the model with the subset, which varies from 100K to 1200K. The results show that with only 100K high-quality data, the model has a significant improvement on both benchmarks, this further proves the effectiveness of the high-quality data. Meanwhile, with the scaling of training data, the model performance tends to be saturated after more than 1000K data being used for pre-training. This may indicate that with high-quality captions, the modal alignment could be quite efficient and realized with a relatively lightweight data scale.

Number of Learnable ViT Blocks in Pre-training. As detailed in Table 6, we extensively investigated the optimal approach for fine-tuning the vision encoder during the pretraining phase. Compared to freezing the vision encoder during the pretraining phase, we found that unlocking the latter half of its transformer blocks significantly enhances performance. Specifically, such an approach led to a 52.2 gain on the MME perception benchmark, and substantial accuracy improvements of 2.2 $\%$ and 1.6 $\%$ on the MMBench and SEED benchmarks, respectively. This suggests that for high-quality captions, unlocking the vision encoder facilitates more effective modality alignment.

Conclusion

In this study, we introduce ShareGPT4V, a groundbreaking large-scale image-text dataset with 1.2 million detailed and informative captions that surpass existing datasets in terms of richness and diversity, covering world knowledge, object attributes, spatial relationships, and aesthetic assessments. ShareGPT4V comprises 100K high-quality captions from GPT4-Vision for Supervised Fine-Tuning (SFT), expanded to 1.2 million for pre-training through a general caption model. We validated ShareGPT4V’s effectiveness through SFT results on recent LMMs and further demonstrated its capabilities with the superior performance of our ShareGPT4V-7B model, which incorporates the dataset in both pre-training and SFT stages. We are committed to making ShareGPT4V fully accessible to the public, with the aspiration that it becomes a foundational resource in advancing the field of LMMs.

Appendix A Data Sources

Data Source Composition for ShareGPT4V. To maximize the comprehensiveness of our captions, we compiled a total of 100K images from diverse sources. This includes 50K images from COCO , 30K images from ’LCS’ (which abbreviates LAION , CC-3M , and SBU ), 20K images from SAM , 500 images from TextCaps , 500 images from WikiArt , and 1K images from web-crawled data (split evenly between images of landmarks and images of celebrities).

Data Source Composition for ShareGPT4V-PT. We utilized our pre-trained Share-Captioner to generate the pre-training dataset. This dataset is comprised of a subset of 1.2M images selected from existing public datasets. These include 118K images from COCO ,570K images from SAM , and 558K images from LLaVA-1.5 pre-training data .

Appendix B Caption Analysis

Figure 7 provides a visualization of the root noun-verb pairs for the captions generated by both GPT4-Vision and Share-Captioner. It’s clear to see that the diversity and linguistic expression of the captions produced by Share-Captioner are comparable to those of GPT4-Vision.

We analyzed the lexical composition of the captions produced by GPT4-Vision and Share-Captioner, and the results are presented in Table 7. The analysis reveals that the captions generated by our Share-Captioner contain a comparable amount of information to those generated by GPT4-Vision.

Appendix C Prompts

Given the diversity of our image sources, we expect a highly content-related description for each image. As shown in Figure 8, we designed a base prompt for a general description and added a specialized prompt for each data source.