MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny

Introduction

In recent years, large language models (LLMs) have experienced rapid advancements (Ouyang et al., 2022; OpenAI, 2022; Brown et al., 2020; Scao et al., 2022a; Touvron et al., 2023; Chowdhery et al., 2022; Hoffmann et al., 2022). With exceptional language understanding capabilities, these models can perform a variety of intricate linguistic tasks in a zero-shot manner. Notably, GPT-4, a large-scale multimodal model, has been recently introduced and demonstrated several impressive capabilities of vision-language understanding and generation (OpenAI, 2023). For example, GPT-4 can produce detailed and accurate image descriptions, explain unusual visual phenomena, and even construct websites based on handwritten text instructions.

Although GPT-4 has exhibited remarkable vision language capabilities, the methods behind its exceptional abilities are still a mystery (OpenAI, 2023). We believe that these impressive skills may stem from the utilization of a more advanced large language model (LLM). LLMs have demonstrated various emergent abilities, as evidenced in GPT-3’s few-shot prompting setup (Brown et al., 2020) and the findings of Wei et al. (2022) (Wei et al., 2022). Such emergent properties are hard to find in smaller-scale models. It is conjectured that these emergent abilities are also applicable to multi-modal models, which could be the foundation of GPT-4’s impressive visual description capabilities.

To substantiate our hypothesis, we present a novel vision-language model named MiniGPT-4. It utilizes an advanced large language model (LLM), Vicuna (Chiang et al., 2023), which is built upon LLaMA (Touvron et al., 2023) and reported to achieve 90% of ChatGPT’s quality as per GPT-4’s evaluation, as the language decoder. In terms of visual perception, we employ the same pretrained vision components of BLIP-2 (Li et al., 2023) that consists of a ViT-G/14 from EVA-CLIP (Fang et al., 2022) and a Q-Former network. MiniGPT-4 adds a single projection layer to align the encoded visual features with the Vicuna language model and freezes all the other vision and language components. MiniGPT-4 is initially trained for 20k steps using a batch size of 256 on 4 A100 GPUs, leveraging a combined image captioning dataset that includes images from LAION (Schuhmann et al., 2021), Conceptual Captions (Changpinyo et al., 2021; Sharma et al., 2018), and SBU (Ordonez et al., 2011) to align visual features with the Vicuna language model. Nevertheless, merely aligning visual features with the language model (LLM) is inadequate to ensure robust visual conversation capabilities, resembling that of a chatbot. The presence of underlying noise in raw image-text pairs can lead to subpar language outputs. Therefore, we collect another 3,500 detailed image description pairs to further fine-tune the model with a designed conversational template in order to improve the naturalness of the generated language and its usability.

In our experiments, we discovered that MiniGPT-4 possesses numerous capabilities similar to those demonstrated by GPT-4. For instance, MiniGPT-4 can generate intricate image descriptions, create websites based on handwritten text instructions, and explain unusual visual phenomena. Furthermore, our findings revealed that MiniGPT-4 also has a variety of other intriguing abilities not showcased in the GPT-4 demonstrations. For example, MiniGPT-4 can directly generate detailed cooking recipes from food photos, write stories or poems inspired by images, write advertisements for products in images, identify problems shown in photos and provide corresponding solutions, and retrieve rich facts about people, movies, or art directly from images, among other capabilities. These abilities are absent in previous vision-language models like Kosmos-1 (Huang et al., 2023) and BLIP-2 (Li et al., 2023) that use less powerful language models. This further validates that integrating visual features with an advanced language model is one of the keys to enhancing vision-language models.

We present a summary of our key findings:

Our research reveals with compelling evidence that by aligning visual features with advanced large language models like Vicuna, MiniGPT-4 can achieve advanced vision-language capabilities comparable to those exhibited in the GPT-4 demonstrations.

Our findings suggest that training merely one projection layer can effectively align a pretrained vision encoder with the large language model. Our MiniGPT-4 only requires training approximately 10 hours on 4 A100 GPUs.

We discovered that simply aligning visual features with large language models using short image caption pairs is not sufficient for developing a well-performing model and leads to unnatural language generation. Further finetuning with a small but detailed image description pairs can address this limitation and significantly improves its usability.

Related Works

Large language models have experienced tremendous success in recent years due to the scaling up of training data and an increase in the number of parameters. Early models, such as BERT (Devlin et al., 2018), GPT-2 (Radford et al., 2019), and T5 (Raffel et al., 2020), laid the foundation for this progress. Subsequently, GPT-3 (Brown et al., 2020), with a massive scale of 175 billion parameters, was introduced, demonstrating significant breakthroughs across numerous language benchmarks. This development inspired the creation of various other large language models, including Megatron-Turing NLG (Smith et al., 2022), Chinchilla (Hoffmann et al., 2022), PaLM (Chowdhery et al., 2022), OPT (Zhang et al., 2022), BLOOM (Scao et al., 2022b), and LLaMA (Touvron et al., 2023), among others. Wei et al. (Wei et al., 2022) further discovered several emergent abilities, which appear exclusively in large models. The emergence of these abilities underscores the importance of scaling up in the development of large language models. Moreover, by aligning the pre-trained large language model GPT-3 with human intent, instructions and human feedback, InstructGPT (Ouyang et al., 2022) and ChatGPT (OpenAI, 2022) enable conversational interactions with humans and can answer a wide range of diverse and complex questions. More recently, several open-sourced models, such as Alpaca (Taori et al., 2023) and Vicuna (Chiang et al., 2023), have been developed based on LLaMA (Touvron et al., 2023) and also exhibit similar performance.

Leveraging Pre-trained LLMs in Vision-Language Tasks. In recent years, the trend of using autoregressive language models as decoders in vision-language tasks has gained significant traction (Chen et al., 2022; Huang et al., 2023; Yang et al., 2022; Tiong et al., 2022; Alayrac et al., 2022; Li et al., 2023; 2022; Driess et al., 2023). This approach takes advantage of cross-modal transfer, allowing knowledge to be shared between language and multimodal domains. Pioneering studies like VisualGPT (Chen et al., 2022) and Frozen (Tsimpoukelli et al., 2021) have demonstrated the benefits of employing a pre-trained language model as a vision-language model decoder. Flamingo (Alayrac et al., 2022) was then developed to align a pre-trained vision encoder and language model using gated cross-attention, and was trained on billions of image-text pairs, showcasing impressive in-context few-shot learning capabilities. Following that, BLIP-2 (Li et al., 2023) was introduced, employing a Flan-T5 (Chung et al., 2022) with a Q-Former to efficiently align visual features with the language model. Most recently, PaLM-E (Driess et al., 2023), featuring 562 billion parameters, has been developed to integrate real-world continuous sensor modalities into an LLM, thereby establishing a connection between real-world perceptions and human languages. GPT-4 (OpenAI, 2023) has also been recently released, showcasing more powerful visual understanding and reasoning abilities after pre-training on a vast collection of aligned image-text data.

LLMs, such as ChatGPT, have proven to be powerful tools in enhancing the performance of vision-language tasks by collaborating with other specialized models. For instance, Visual ChatGPT (Wu et al., 2023) and MM-REACT (Yang* et al., 2023) showcase how ChatGPT can act as a coordinator, integrating with diverse visual foundation models and facilitating their collaboration to tackle more complex challenges. ChatCaptioner (Zhu et al., 2023) treats ChatGPT as a questioner, prompting diverse questions for BLIP-2 to answer. Through multi-round conversations, ChatGPT extracts visual information from BLIP-2 and effectively summarizes the image content. Video ChatCaptioner (Chen et al., 2023) extends this approach, applying it to video spatiotemporal understanding. ViperGPT (Surís et al., 2023) demonstrates the potential of combining an LLM with different vision models to address complex visual queries programmatically. In contrast, MiniGPT-4 directly aligns visual information with the language model to accomplish diverse vision-language tasks without the usage of external vision models.

Method

MiniGPT-4 aims to align visual information from a pretrained vision encoder with an advanced large language model (LLM). Specifically, we utilize the Vicuna (Chiang et al., 2023) as our language decoder, which is constructed upon LLaMA (Touvron et al., 2023) and can perform a wide range of complex linguistic tasks. For visual perception, we employ the same visual encoder as used in BLIP-2 (Li et al., 2023), a ViT backbone (Fang et al., 2022) coupled with their pre-trained Q-Former. Both language and vision models are open-sourced. We target to bridge the gap between the visual encoder and LLM using a linear projection layer, with an overview of our model displayed in Fig.1.

To achieve an effective MiniGPT-4, we propose a two-stage training approach. The initial stage involves pretraining the model on a large collection of aligned image-text pairs to acquire vision-language knowledge. In the second stage, we finetune the pretrained model with a smaller but high-quality image-text dataset with a designed conversational template to enhance generation reliability and usability.

During the initial pretraining stage, the model is designed to acquire vision-language knowledge from a large collection of aligned image-text pairs. We regard the output from the injected projection layer as a soft prompt for the LLM, prompting it to generate the corresponding ground-truth texts.

Throughout the entire pretraining process, both the pretrained vision encoder and the LLM remain frozen, with only the linear projection layer being pretrained. We use a combined dataset of Conceptual Caption (Changpinyo et al., 2021; Sharma et al., 2018), SBU (Ordonez et al., 2011) and LAION (Schuhmann et al., 2021) to train our model. Our model undergoes 20,000 training steps with a batch size of 256, covering approximately 5 million image-text pairs. The entire process takes about 10 hours to complete, utilizing 4 A100 (80GB) GPUs.

Issues of the first pretraining stage Following the first pretraining stage, our MiniGPT-4 demonstrates the capacity to possess a wealth of knowledge and offer reasonable responses to human inquiries. However, we have observed instances where it produces incoherent linguistic outputs, such as repetitive words or sentences, fragmented sentences, or irrelevant content. These issues hinder MiniGPT-4’s ability to engage in a fluent visual conversation with humans.

We also observed similar challenges encountered in GPT-3. Despite its pretraining on a extensive language dataset, GPT-3 struggles to generate language outputs that are accurately aligned with users’ intentions. Through a process of instruction fine-tuning and reinforcement learning from human feedback, GPT-3 evolves into GPT-3.5 (Ouyang et al., 2022; OpenAI, 2022) and becomes capable of producing more human-friendly outputs. This phenomenon bears a resemblance to the current state of MiniGPT-4 following its initial pretraining stage. As such, it is not surprising that our model may struggle to generate fluent and natural human language outputs at this stage.

2 Curating a high-quality alignment dataset for vision-language domain.

To achieve greater naturalness in the generated language and enhance the model’s usability, a second-stage alignment process is essential. While in the realm of NLP, instruction fine-tuning datasets (Taori et al., 2023) and conversations (sha, 2023) are easily accessible, no equivalent datasets exist for the vision-language domain. To address this deficiency, we carefully curated a detailed image description dataset, specifically tailored for vision-language alignment purposes. This dataset is subsequently utilized to fine-tune our MiniGPT-4 during the second-stage alignment process.

In the initial phase, we employ the model derived from the first pretraining stage to generate comprehensive descriptions of input images. To enable our model to produce more detailed image descriptions, we designed a prompt that adheres to the conversational format of the Vicuna (Chiang et al., 2023) language model, as shown below. In this prompt, represents the visual features produced by the linear projection layer.

###Human: Describe this image in detail. Give as many details as possible. Say everything you see. ###Assistant:

To identify incomplete sentences, we examine whether the generated sentence exceeds 80 tokens. If it does not, we incorporate an additional prompt, ###Human: Continue ###Assistant: , prompting our MiniGPT-4 to extend the generation process. By concatenating the outputs from both steps, we can create a more comprehensive image description. This approach enables us to generate image-text pairs with detailed and informative image descriptions. We randomly select 5,000 images from the Conceptual Caption dataset (Changpinyo et al., 2021; Sharma et al., 2018) and use the pretrained model to generate corresponding language descriptions for each image.

Data post-processing

The above automatically generated image descriptions contain noisy or incoherent descriptions, such as repetition of words or sentences, fragmented sentences, or irrelevant content. In order to fix these issues, we employ ChatGPT to mend the descriptions by utilizing the following prompt:

Fix the error in the given paragraph. Remove any repeating sentences, meaningless characters, not English sentences, and so on. Remove unnecessary repetition. Rewrite any incomplete sentences. Return directly the results without explanation. Return directly the input paragraph if it is already correct without explanation.

Upon completing the post-processing stage, we manually verify the correctness of each image description to guarantee its high quality. Specifically, we first identified several frequently shown errors (“I’m sorry I made a mistake…”, or “I apologize for that …”) and then hard-coded rules to automatically filter them out. We also manually refine the generated captions by eliminating redundant words or sentences that ChatGPT fails to detect. Finally, only approximately 3,500 out of 5,000 image-text pairs satisfy our requirement, and these pairs are subsequently utilized for the second-stage alignment process.

3 Second-stage finetuning

During the second stage, we finetune our pretrained model with the curated high-quality image-text pairs. During the finetuning, we use the predefined prompts in the following template:

###Human: ###Assistant:

In this prompt, represents a randomly sampled instruction from our predefined instruction set containing variant forms of instructions such as “Describe this image in detail” or “Could you describe the contents of this image for me”. It is important to note that we do not calculate the regression loss for this specific text-image prompt.

As a result, MiniGPT-4 is now capable of producing more natural and reliable language outputs. Furthermore, we observed that this fine-tuning process is remarkably efficient, only requiring a mere 400 training steps with a batch size of 12, which takes around 7 minutes with a single A100 GPU.

Experiments

In the experiment, we aim to showcase the diverse and emergent capabilities of our MiniGPT-4 model through various qualitative examples. These abilities include generating detailed image descriptions, identifying amusing aspects within memes, providing food recipes from photos, writing poems for images, etc. Additionally, we present quantitative results on the task of image captioning.

MiniGPT-4 demonstrates many advanced abilities compared to traditional vision-language models. For example, it can describe images in detail and interpret the humorous aspects of a given meme. Here, we qualitatively compared our model to one of the leading vision-language models, BLIP-2 (Li et al., 2023), with eight distinct examples, each highlighting a different ability.

An example in Fig.3 demonstrates that MiniGPT-4 effectively identifies various elements within the image, such as busy city streets, clock towers, shops, restaurants, motorcycles, people, streetlights, and clouds. In contrast, BLIP-2 can only cover city streets, people, and motorcycles in its image caption generation. Another example presented in Fig.4(a) shows that MiniGPT-4 successfully explains why the meme is humorous. It interprets that the lying dog is feeling the same way as many people do on Monday, which is often considered to be the most dreaded day of the week. In contrast, BLIP-2 only briefly describes the image content and fails to comprehend the amusing aspects of the image.

We also showcase MiniGPT-4’s other abilities by demonstrating other distinctive abilities. These include creating advertising promotions based on a given image (Fig.3), retrieving factual information from a movie photograph (Fig.8), generating a food recipe from a food image (Fig.12), diagnosing plant diseases and suggesting treatment plans (Fig.12), creating a website from a hand-written draft (Fig.4(b)), and writing poems inspired by an image (Fig.10). These abilities are absent in traditional vision-language models like BLIP-2 (utilizing Flan-T5 XXL (Chung et al., 2022) as a language model), which use less powerful language models (LLMs). This contrast indicates that those advanced vision-language abilities only emerge when the visual features are properly aligned with an advanced LLM such as Vicuna (Chiang et al., 2023).

2 Quantitative analysis

To quantify performance on advanced vision-language tasks, we compiled a small evaluation dataset comprising 4 tasks: meme interpretation with the question “Explain why this meme is funny.”, recipe generation with the question “How should I make something like this?”, advertisement creation with the prompt “Help me draft a professional advertisement for this.”, and poem composition with “Can you craft a beautiful poem about this image?”. In total, we collect 100 diverse images, with 25 images allocated to each task. We asked human evaluators to determine whether the model generation satisfies the request. We compared our results with BLIP-2 (Li et al., 2023) and present the findings in Tab.LABEL:tab:_quanti_adv. In meme interpretation, poem writing, and advertisement creation, BLIP-2 largely struggles to fulfill any requests. For recipe generation, BLIP-2 succeeds in 4 out of 25 cases. In contrast, MiniGPT-4 manages to address the requests in recipes, advertisements, and poem generation in nearly 80% of the instances. Furthermore, MiniGPT-4 correctly comprehends the challenging humor understanding in memes in 8 out of 25 cases.

Image Captioning

We evaluate the performance of MiniGPT-4 on the COCO caption benchmark and compare it with BLIP-2 (Li et al., 2023). Our model’s generated captions typically contain rich visual details. As such, conventional similarity-based image-caption evaluation metrics struggle to provide an accurate evaluation of our models. In this regard, we evaluate the performance by checking if the generated captions cover all the ground truth captions’ information with the help of ChatGPT and details can be found in Appx.A.3. Results in Tab.LABEL:human_evaluation shows that MiniGPT-4 outperforms BLIP-2 in generating captions that are more closely aligned with the ground-truth visual objects and relationships. With a success rate of 66.2%, MiniGPT-4 is considerably more accurate than BLIP-2, which achieves only 27.5%. Further evaluation on traditional VQA tasks can be found in Appx.A.2.

3 Analysis on the second-stage finetuning

The utilization of only the model pretrained after the first pretraining stage may result in failures, such as the occurrence of repetitive words or sentences, fragmented sentences, or irrelevant content. However, these issues have been largely mitigated through the second-stage finetuning process. This can be observed in Fig.6, where MiniGPT-4 generates incomplete captions before the second-stage finetuning. However, after the second-stage finetuning, MiniGPT-4 is capable of generating complete and fluent captions. In this section, we investigate the importance and effectiveness of the second-stage finetuning approach.

To quantify the impact of second-stage finetuning, we randomly sampled 100 images from the COCO test set and investigated the model performance on two tasks: detailed description generation and poem writing. The prompts used were “Describe the image in detail.” and “Can you write a beautiful poem about this image?”. These tasks were performed by both the models before and after second-stage finetuning. We manually counted the number of failure generations for the model in each stage. The results are presented in Tab.3. Prior to the second-stage finetuning, approximately 1/3 of the generated outputs failed to match ground truth captions or poems. In contrast, the model after second-stage fineuning has less than two failure cases out of the 100 test images for both tasks. These experimental results demonstrate that second-stage finetuning yields a significant improvement in the quality of generated outputs. A qualitative example of the model generation before and after the second-stage finetuning is shown in Fig.6.

Can the original BLIP-2 benefit from the second-stage data?

In this study, we finetune BLIP-2 (Li et al., 2023) with our second-stage data in the same way as MiniGPT-4, and check if it can obtain similar advanced abilities as MiniGPT-4. The finetuned BLIP-2 is denoted as BLIP-2 FT. Note that MiniGPT-4 uses the same visual module as BLIP-2; while BLIP-2 uses FlanT5 XXL (Chung et al., 2022) as the language model, which is not as strong as the Vicuna (Chiang et al., 2023) model used in our MiniGPT-4 model. We rely on the same prompts to assess the advanced capabilities of our model. Qualitative results are shown in Fig.4, 14, and 14. We discover that BLIP-2 FT still generates short responses and fails to generalize to advanced tasks like meme explaining and website coding (Fig.4). Our finding suggests that BLIP-2’s relatively weaker language model FlanT5 XXL benefits less from such a small dataset, and highlights the effectiveness of a more advanced LLM in a VLM system.

Second stage with Localized Narratives

The dataset Localized Narratives (Pont-Tuset et al., 2020) is a detailed image description dataset where annotators describe images while simultaneously localizing the corresponding regions. Here, we test the performance of our model by replacing our self-collected dataset in the second-stage with the Localized Narratives dataset. The model is denoted as MiniGPT-4 LocNa. Qualitative results in Fig.4, 14, and 14 show that MiniGPT-4 LocNa can generate long image descriptions (Fig.14). However, the generated outputs have lower quality with monotonous expressions. Besides, MiniGPT-4 LocNa does not generalize as well as the original MiniGPT-4 in other complex tasks like explaining why the meme is funny (Fig.4(a)). The performance gap may be due to the monotonous and repeated image descriptions in Localized Narratives.

4 Ablation on the architecture designs

To further demonstrate the effectiveness of using one single linear layer to align visual features with LLM, we conduct experiments with different architecture designs, including (a) removing the Q-Former and directly mapping the VIT’s output to Vicuna’s embedding space (i.e., without Q-former), (b) using three linear layers instead of one layer, and (c) additionally finetuning the Q-Former in the vision module. All the variants are trained in the same way as the original design. Results on AOK-VQA (Schwenk et al., 2022) and GQA (Hudson & Manning, 2019) datasets in Tab.5 show that the variant (a) MiniGPT-4 w/o Q-Former has a similar performance to the original design. Qualitative results of this variant in Fig.4, 14, and 14 also show similar advanced skills. This reveals that the Q-Former from BLIP-2 doesn’t plays a critical roles for advanced skills. Besides, both variants (b) MiniGPT-4+ 3 Layers and (c) MiniGPT-4 + finetuning Q-Former, perform slightly worse than the original MiniGPT-4. This indicates a single projection layer is sufficient to align the vision encoder and the large language model in our limited training data setting.

5 Limitation analysis

As MiniGPT-4 is built upon LLMs, it inherits LLM’s limitations like hallucinating nonexistent knowledge. An example in Fig. 6 shows that MiniGPT-4 incorrectly identifies the presence of white tablecloths in the image, despite their absence. Here, we use the metric $\text{CHAIR}_{i}$ (Rohrbach et al., 2018) to gauge the hallucination rate of the generation, with the two distinct prompts to control the model generation length: MiniGPT-4 (long): Please describe this image as detailed as possible. MiniGPT-4 (short): Please describe the image shortly and precisely, in less than 20 words.

Results in Tab.5 show that longer captions tend to have higher hallucination rates. For example, MiniGPT-4 (long) generates captions averaging 175 words with a higher hallucination rate, while MiniGPT-4 (short) averages 28.8 words with a lower rate. BLIP-2, averaging 6.5 words, hallucinates less but covers fewer objects as seen in Tab.LABEL:human_evaluation. Hallucination in detailed image descriptions is still an unresolved issue. Using Reinforcement Learning with AI feadback with hallucination detection modules may be a potential solution.

Spatial Information Understanding

MiniGPT-4’s visual perception remains limited. It may struggle to differentiate spatial localization. For example, MiniGPT-4 in Fig. 6 fails to identify the location of the windows. This limitation may stem from a lack of aligned image-text data designed for spatial information understanding. Training on such datasets like RefCOCO (Kazemzadeh et al., 2014) or Visual Genome (Krishna et al., 2017) could potentially alleviate this issue.

Discussion

How does MiniGPT-4 obtain these advanced abilities? Many of the advanced vision-language capabilities demonstrated by GPT-4 can be understood as compositional skills rooted in two foundational skills: image understanding and language generation. Take the task of image-based poem writing as an example. Advanced LLMs like ChatGPT and Vicuna can already craft poems based on users’ instructions. If they acquire the ability to understand images, compositionally generalizing to the task of image-based poem writing even without having image-poem pairs in their training data is possible.

In the first pretraining stage, MiniGPT-4 learns to understand images by modeling the correlation between images and short image descriptions from image caption datasets. However, the language style in these image caption datasets differs from that of modern LLMs’ generation, which leads to distorted language generation and hinders successful compositional generalization. Therefore, we introduce a second-stage finetuning to restore the language generation ability. MiniGPT-4 after the two-stage training successfully generalizes to many advanced compositional vision-language abilities like website coding from drafts or meme interpretation, verifies our assumption. Future research might delve deeper into the mechanism of compositional generalization and seek ways to enhance them. We hope our work, as an early exploration of these vision-based LLM capabilities, will spur further investigations in this domain.

References

Appendix A Appendix

A.2 Evaluation in traditional VQA benchmarks

The aim of this study is to replicate the remarkable multi-modal capabilities demonstrated in GPT-4, such as generating detailed image descriptions and creating websites from hand-drawn drafts. To emphasize the most crucial component of advanced vision-language skills, the methodology of MiniGPT-4 is intentionally kept minimal. For instance, the learnable model capacity is limited (only one linear layer), and MiniGPT-4 is trained with just 5 million pairs, in contrast to BLIP-2 with 129 million image-text pairs. Such a pared-down approach is anticipated to yield suboptimal results on traditional benchmarks. While this isn’t our primary goal, we offer a quantitative analysis of the VQA datasets A-OKVQA (multi-choice) (Schwenk et al., 2022) and GQA (Hudson & Manning, 2019). Additionally, to showcase the potential of MiniGPT-4 with traditional benchmarks, we conduct a straightforward ablation study. Here, we simply unfreeze the LLM using LoRA (Hu et al., 2021) and incorporate more training data from the VQAv2, OKVQA, and A-OKVQA datasets during the second finetuning stage. Results in Tab. 6 indicate that the original MiniGPT-4 lags behind BLIP-2 by a reasonable margin, and merely augmenting the learning capacity and the training data results in a substantial performance improvement, which confirms our expectations. We believe our model’s performance on conventional vision benchmarks can be enhanced with a carefully designed training strategy (e.g., dataset sample ratios, learning rate schedule, etc.), more training data/datasets, and additional learnable parameters. Since enhancing performance on traditional vision benchmarks isn’t this project’s objective, we reserve this aspect for future research.

A.3 Details of Caption Evaluation

We employ ChatGPT to determine whether the baseline models cover all the objects and visual relations presented in the ground-truth captions. For the COCO evaluation dataset, we randomly choose one ground-truth caption and treat it as the reference caption. We apply the following prompt to perform the evaluation.

There is one image caption1 ‘{ground-truth caption}’, and there is another image caption2 ‘{comparison caption}’. Does image caption2 cover all the objects and visual relations shown in image caption1? Only answer yes or no without any explanation.