Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai

Introduction

The field of large multimodal models (LMMs) is advancing quickly because of their skill in handling different types of data, like images and text. Their success in various tasks, including image captioning and visual question answering, is attracting attention in the academic community.

Training LMMs benefits greatly from high-resolution images , because higher resolution allows these models to detect more nuanced visual details, leading to accurate recognition of objects, their interrelationships, and the broader context within the image. Additionally, the improved visual clarity of high-resolution images aids in effectively capturing and representing complex details essential for detailed captioning. Despite advancements, handling the wide range of image resolutions and training data quality is still challenging, especially in complex situations. Solutions include using pre-trained visual modules with larger input resolution (like LLaVA1.5 ) and gradually increasing the resolution of the training process through curriculum learning (like Qwen-VL , PaLI-3 and PaLI-X ) have been explored, but they demand significant training resources and still face challenges in handling larger image sizes. To fully leverage the benefits of large input resolution, it is crucial to have more detailed image descriptions, which can enhance the understanding of image-text relationships. However, the short captions in widely used datasets such as COYO and LAION are usually intuitively insufficient.

We introduce Monkey, a resource-efficient approach to increase input resolution within the Large Multimodal Model frameworks. Compared to the approach of directly interpolating the ViT to increase input resolution, Monkey utilizes a new module that divides high-resolution images into smaller patches using a sliding window method. Each patch is processed independently by a static visual encoder, enhanced with LoRA adjustments and a trainable visual resampler. This technique leverages existing LMMs while circumventing the need for extensive pre-training. The key idea is that these encoders are typically trained on smaller resolutions (like 448×\times448), which is costly to train from scratch. By resizing each patch to its supported resolution, we maintain the training data distribution for the encoder. Our method, which uses various trainable patches to enhance resolution, shows a clear advantage over traditional interpolation techniques for positional embedding, as demonstrated by our quantitative analysis.

To further leverage the advantage of large resolution, we have also proposed an automatic multi-level description generation method. This method is designed to produce high-quality, abundant caption data by seamlessly combining insights from multiple generators. It utilizes the strengths of a diverse array of advanced systems: BLIP2 , known for its nuanced image-text understanding; PPOCR , a robust optical character recognition system; GRIT , which excels in granular image-text alignments; SAM , a dynamic model for semantic alignment; and ChatGPT , an AI renowned for its contextual understanding and language generation capabilities. By integrating the unique capabilities of these systems, our method offers a comprehensive and layered approach to caption generation, capturing a wide spectrum of visual details.

We summarize the advantages of the Monkey as follows:

Support resolution up to 1344×\times896 without pretraining. By going beyond the usual 448×\times448 resolution used in LMMs, the higher resolution helps to better identify and understand small or closely grouped objects and dense text.

Contextual associations. We introduce a multi-level description generation method that improves the model’s ability to grasp the relationships among multiple targets and more effectively utilize common knowledge in generating text descriptions.

Performance enhancements on many evaluation datasets. As shown in Fig. 1, we carried out testing across 18 diverse datasets, leading to a very competitive performance by our Monkey model in tasks such as Image Captioning, General Visual Question Answering, Scene Text-centric Visual Question Answering, and Document-oriented Visual Question Answering. In particular, during qualitative evaluations centered on dense text question answering, Monkey has shown promising results, comparing with GPT4V.

Related Work

The Large Multimodal Models (LMMs) field has seen significant progress, particularly in enhancing visual and language processing. Methods like Flamingo and OpenFlamingo have advanced visual representation by integrating a Perceiver Resampler with vision encoders. BLIP2 employs a Q-Former to link the frozen LLM and vision encoder. Unified-IO demonstrates versatility by training across over 80 diverse datasets, widening its domain applicability. PaLM-E adopts a unique approach by treating images and text as “multimodal sentences” to improve visual-language tasks. MiniGPT4 bridges visual modules and LLMs, enhancing multimodal capabilities. InstructBLIP , starting from BLIP2, adds instructional inputs to the Q-Former for task-relevant visual features. MME introduces a benchmark for evaluating LMMs’ perception and cognition.

Additionally, there has been significant progress in leveraging large language models. The LLaVA series, including LLaVA and LLaVA1.5 , align vision encoders and LLMs for better image-text understanding. mPLUG-Owl focuses on fine-tuning with mixed text and visual-text data. mPLUG-Owl2 introduces shared modules for better modality collaboration. KOSMOS-2 enables visual answers like detection boxes. Shikra specializes in Referential Dialogue, adept at processing positional inputs and outputs. BLiVA combines task-related and global features for enhanced multimodal task processing. Qwen-VL improves visual module resolution to 448. OtterHD fine-tunes Fuyu-8B with instruction/response pairs, maintaining the original image size during inference.

Despite these advancements, challenges remain in extracting finer image features, as noted by , which indicate the need for ongoing development in the field.

Methods

Fig. 2 illustrates the comprehensive architecture of Monkey. Initially, the input image is segmented into patches. These patches are then processed through a shared Vision Transformer (ViT) equipped with distinct adapters. Subsequently, both local and global features, along with the question, are processed using the shared resampler and the Large Language Model (LLM), resulting in the generation of the desired answers.

Input resolution is crucial for accurately interpreting text and detailed image features. Previous studies have shown the effectiveness of starting with smaller resolutions and progressively advancing to larger ones through curriculum learning. However, this approach can be highly resource-demanding, often necessitating comprehensive pretraining with large-scale data (as seen in Qwen-VL, which supports resolutions up to 448×\times448). To address these issues and efficiently enhance resolution, we introduce a simple yet more effective technique.

To preserve the overall structural information of input image, we resize the original image to dimensions (Hv,WvH_{v},W_{v}), maintaining it as a global image. Following this, both the individual patches and the global image are processed through the visual encoder and resampler concurrently. Here, the visual resampler, inspired by Flamingo , is a mechanism that performs two main functions: summarizing visual information and obtaining higher semantic visual representations in a language feature space. It achieves this by leveraging a cross-attention module. The module employs trainable vectors (embeddings) as query vectors, along with image features from the visual encoder serving as keys for cross-attention operations.

This approach strikes a balance between detailed and holistic perspectives of the images, thereby enhancing the model performance while avoiding a substantial increase in computational demand.

2 Multi-level Description Generation

Previous models such as LLaVA and Qwen-VL used large datasets like LAION , COYO , and CC3M for their initial training. However, these datasets often offer image-text pairs that are too simple (e.g., one short sentence to describe a complicated image), lacking in detailed imagery. As a result, even when these models are trained with high-resolution images, they struggle to accurately link visual features with basic captions. This limitation affects the models to effectively combine visual processing with language understanding.

To bridge this gap, we develop a novel approach for generating multi-level descriptions automatically. This technique is designed to create rich and high-quality caption data by effectively blending the outputs from various generators. We utilize a combination of several advanced systems, each bringing its own strength to the process: BLIP2 , which provides a deep understanding of the relationship between images and text; PPOCR , a strong performer in optical character recognition; GRIT , specializing in detailed image-text matching; SAM , focused on semantic alignment; and ChatGPT , known for its exceptional ability in contextual language generation.

As shown in Fig. 3, the image description process begins with BLIP2 creating overall captions using a Q-former for tight integration with the vision encoder and LLM, while retaining original CC3M annotations for context. Next, GRIT, a region-to-text model, generates detailed descriptions of specific regions, objects, and their characteristics. PPOCR extracts text from the images, and SAM segments and identifies objects and their parts. These objects are then individually described by BLIP2. However, to counter potential inaccuracies from these tools, especially in zero-shot settings, we find it essential to further use BLIP2 to check for consistency between image areas, objects, and their descriptions, filtering out low-scoring matches. Finally, all data, including global captions, localized descriptions, text extracts, and object details with spatial coordinates, are fed into the ChatGPT API for fine-tuning, enabling ChatGPT to generate accurate and contextually rich image descriptions.

By merging the unique features of these systems, our approach achieves a layered and comprehensive style of caption creation. It captures an extensive range of visual and textual nuances, resulting in captions that are not just elaborate, but also contextually diverse and engaging.

3 Multi-task Training

Our goal is to train a model that is both cost-effective and capable of understanding different types of images for various tasks. By integrating various datasets and employing uniform instructions for all tasks, as guided by , we enhance the model’s learning ability and training efficiency.

We focus on tasks such as creating image captions, responding to image-based questions, and other activities requiring the model to process both text and images. For captioning, we instruct the model with “Generate the caption in English:” for basic captions, and “Generate the detailed caption in English:” for more intricate ones. When it comes to answering questions about images, we use a straightforward format: “{question} Answer: {answer}.”

In our training process, we use a variety of public datasets tailored to specific tasks. For image captioning, we include both our own detailed captions and established datasets like COCO caption and TextCaps . For general Visual Question Answering (VQA), we utilize datasets such as VQAV2 , OKVQA , GQA , ScienceQA , and VizWiz . For Text-centric VQA tasks, we select TextVQA , OCRVQA , and AI2D ; while for document-related VQA, we employ datasets like DocVQA , ChartQA , InfoVQA , DeepForm , Kleister Charity (KLC) , WikiTableQuestions (WTQ) , TableFact , and VisualMRC . To ensure balanced training, we control the image count for each task as detailed in Tab. 1. Our compiled dataset, with around 1.44 million examples, is designed to train our model effectively in understanding and executing various instructions.

Experiment

We evaluate our model by testing it across a spectrum of standard vision-language tasks, including the generation of image descriptions, answering diverse visual questions, and comprehending targeted phrases in images.

Model Configuration. We conduct experiments based on the well-trained Vit-BigG and LLM from Qwen-VL , the pre-trained large multimodal model. Since the vision encoder has already been well pretrained, we proceed directly to the instruction-tuning stage. During instruction tuning, HvH_{v}, WvW_{v} are set to 448 to match the encoder of Qwen-VL. We employ a consistent resampler across all crops. The learnable queries engage with local features, utilizing the same set of 256 learnable queries for each crop. Due to limitations in training time, our main experiments were mainly conducted using images of size 896×\times896 unless specify. For LoRA, we set the rank to 16 for the attention module and 32 for MLP in the encoder. Monkey includes 7.7B parameters for a large language model, with 90M parameters for the resampling module, an encoder with 1.9B parameters, and 117M parameters for LoRA. The overall parameters for Monkey is 9.8B.

Training. We use our multi-level description generation method to regenerate around 427k image-text pairs from the CC3M dataset, previously used in LLaVA’s pretraining. During the training process, we utilize the AdamW optimizer with a learning rate of 1e-5 and the cosine learning rate schedule. Additionally, we set the values of β1\beta_{1} and β2\beta_{2} to 0.9 and 0.95, respectively. We incorporate a warmup period of 100 steps and employ a batch size of 1024. To control overfitting, we apply a weight decay of 0.1. The whole training process takes 40 A800 days for one epoch.

2 Results

We report the results on Image Caption, General VQA, Scene Text-centric VQA, and Document-oriented VQA. We also conduct testing on the MME benchmark and achieve a perception score of 1505.3, ranking second, as shown in Fig. 1. The details of each dataset can be found in Appendix A.

Image Caption. Image captioning is vital for connecting visual content with the understanding of natural language. In our study, we select Flickr30K and TextCaps as the benchmark for testing the image captioning task. TextCaps challenges the model to interpret and reason text within images effectively. We present our model’s performance on Flickr30K and TextCaps in Tab. 2, where the results indicate that Monkey demonstrates enhanced performance on these datasets. We also qualitatively show effectiveness of our method in offering detailed image descriptions in Sec. 4.4 and Appendix B D.

General VQA. General visual question answering (VQA) requires ability to learn visual and textual information, showing a deep understanding of how they interrelate. For General VQA, we validate on five benchmarks: VQAv2 , OKVQA , GQA , ScienceQA , and VizWiz . The performance results are shown in Tab. 2. Our model shows remarkable proficiency in VQAV2, OKVQA, ScienceQA, and VizViz, surpassing the nearest competing method by an average of 1.62%. These results highlight the effectiveness of our method, emphasizing its use of high input resolution and detailed data.

Scene Text-centric VQA. Text information is commonly found in real-world scenes, making the ability to answer questions about text in images a crucial aspect of question-answering tasks. For our evaluation, we employ four datasets: TextVQA , AI2D , STVQA , and ESTVQA . The results, shown in Tab. 3, indicate that our model leads in performance on these datasets, outperforming the nearest competitor by an average of 4.35%. Based on our observation, this enhanced performance is mainly attributed to the increased image resolution, which brings smaller text and finer details into clearer view. Moreover, the inclusion of detailed caption data during training provides valuable textual context, further boosting the robustness of the model.

Document-oriented VQA. Despite the clean backgrounds of documents, their densely packed text poses distinct challenges. To effectively evaluate our model, we select representative benchmarks including DocVQA , ChartQA , InfographicVQA , DeepForm , KLC , and WTQ . The results, as detailed in Tab. 4, show that Monkey surpasses Qwen-VL in most Document-oriented VQA tasks, achieving an averagely significant improvement of 9.77%. The higher resolution of documents reveals more intricate details and a denser concentration of information. Monkey’s capability to process larger input resolutions enhances its spatial perception, thereby improving its recognition and comprehension of various document elements like text, charts, infographics, and forms.

3 Ablation Study

We conduct thorough experiments to validate the effectiveness of our designs.

Ablation study on strategies of enhancing input resolution. We first evaluate the existing technique of improving input resolution, as illustrated in Tab. 5. Resizing the visual encoder using traditional positional position interpolation to a size of 896 results in worse performance compared with our method under the same settings (r1 vs. r9). Interestingly, applying LoRA to the encoder for this traditional interpolation method appears to be less effective than not using it (r1 vs. r2). This may due to the inherited parameters from the previous encoder are specifically tuned by lower resolution, changing it by force may necessitate more training resources.

For our method (r3-r9), as we increase the input size, there is a noticeable boost in performance, especially demonstrated in the DeepForm dataset. It can be observed that adding LORA does not significantly increase FLOPs and the use of one LORA or four LORAs results in a minimal difference in throughput (r7-r9). The model’s ability to discern intricate details and sharper images enhances its understanding of visual aspects such as objects, shapes, and textures, thereby improving its overall visual perception. When we further push the input resolution to 1344×\times896, which is the highest resolution the device can support, the model shows further improvements on high-resolution datasets like DeepForm, InfoVQA, and WTQ, as detailed in Tab. 5. However, we can note that for some datasets, such as TextVQA, using the largest resolution results in a slight decline in performance; nevertheless, the original average resolution in the TextVQA dataset is around 950 pixels in width and 811 pixels in height, further increasing its input resolution seems unnecessary for these images.

Furthermore, as shown in Tab. 6, we consistently demonstrate the effectiveness of our method on LLaVA1.5. Impressively, we noticed significant improvements when we increased the input resolution from 224 to 448, demonstrating the efficiency of our approach.

Trainable Adapters. As shown in Tab. 5, reducing the LoRA number causes a performance decrease. Using one LoRA for all patches compared to not using LoRA provides a better perception of local details (r7 vs. r8), especially with a significant improvement in STVQA. Utilizing four LoRA modules leads to a better performance, which may because this approach enables the model to learn a better understanding of the spatial relationships and contextual information within distinct image regions.

Collaboration between High Resolution and Multi-level Description. To validate the collaboration between High Resolution and Multi-level Description, we conduct ablation studies on LLaVA1.5. We employ a ViT-L as our vision encoder and Vicuna13B as the language model. By replacing the original annotation from CC3M with our generated annotations in the pretraining, we consistently achieved better results on GQA, TextVQA and MMVet , as demonstrated in Tab. 6. Furthermore, we have observed that detailed descriptions consistently yield greater performance enhancements at resolutions of 336 and 448, compared to a resolution of 224. In Appendix E, we provide visualization results for Monkey at different resolutions. These results show that models with high resolution shines when trained with more comprehensive descriptions.

4 Visualization

In a side-by-side qualitative analysis, we compared Monkey with GPT4V and other LMMs on a task of generating detailed captions. The results, illustrated in Fig. 4, demonstrate Monkey’s superior capability in providing exhaustive descriptions of images. For instance, in the image from Fig. 4, both Monkey and GPT4V successfully identified an “Emporio Armani” store in the background. Moreover, Monkey went further in detailing various elements in the scene, such as describing “another woman in a red coat and black pants carrying a black purse”.

Additionally, as shown in Fig. 5, we qualitatively observe that in many cases for understanding complex text-based inquiries, Monkey has shown impressive performance when compared to GPT4V. More visualization results of Monkey can be found in Appendix.

5 Limitation

The capability of our method to process input images is constrained to a maximum of six patches due to the limited input length of the language model. This restriction hampers the further expansion of input resolution.

Moreover, for the multi-level description generation approach, it is capable of describing only the scene presented in the image and its scope is bound by the world knowledge encapsulated in BLIP2 and the original CC3M annotations. For instance, when provided with a photo of a location in a country, the method can describe the visual aspects of the scene, but it lacks the ability to identify and specify that the scene is indeed in which country.

Conclusion

This paper proposes a training-efficient approach to effectively improve the input resolution capacity up to 1344×\times896 pixels without pretraining from the start. To bridge the gap between simple text labels and high input resolution, we propose a multi-level description generation method, which automatically provides rich information that can guide the model to learn the contextual association between scenes and objects. With the synergy of these two designs, our model achieved excellent results on multiple benchmarks. By comparing our model with various LMMs, including GPT4V, our model demonstrates promising performance in image captioning by paying attention to textual information and capturing fine details within the images; its improved input resolution also enables remarkable performance in document images with dense text.

Acknowlegement

This research is supported by NSFC (No. 62225603).

References

Appendix A Summary of the Evaluation Benchmarks

We present a comprehensive overview of the evaluation benchmarks utilized, along with their corresponding metrics Tab. 7. For the Image Caption task, we selected two datasets: Flickr30K , which is an image caption dataset for natural images, and TextCaps , which is an image caption dataset for natural images with text. For general Visual Question Answering (VQA), we chose five commonly used datasets. VQAV2 is an open-ended VQA dataset focused on natural images, while OKVQA requires additional world knowledge. GQA is a dataset designed for real-world visual reasoning and compositional question answering. ScienceQA involves multimodal multiple-choice VQA on science topics, and VizWiz aims to answer questions from blind individuals. In the domain of Scene Text-centric VQA, our selection includes TextVQA , AI2Diagram , STVQA , and ESTVQA . AI2D is a multiple-choice VQA dataset that focuses on science diagrams, while the others involve reading and reasoning about text in natural images. For the STVQA and ESTVQA datasets, we followed the split provided by . Regarding Doc-oriented VQA, we encompass various document images, including documents, charts, infographics, reports, and HTML tables. In the case of DeepForm and KLC , we transform the Key Information Extraction task into a Visual Question Answering (VQA) task. Additionally, we evaluate Monkey on the MME benchmark , which measures perception and cognition abilities. Furthermore, for the ablation study on LLaVA1.5 , we adhere to the evaluation settings specified by LLaVA1.5.

Appendix B More Visualization Results

We presented additional visualization results, where Fig. 6 demonstrates Monkey’s capabilities in various VQA tasks. Monkey analyzes the question, identifies the key elements in the image relevant to answering the question, and exhibits the ability to perceive even minute text within the image. Moreover, Monkey can reason about the objects present in the scene and possesses a strong understanding of visual charts. In addition, Fig. 6 also showcases Monkey’s impressive captioning ability, accurately describing various objects in the image and providing appropriate summaries.

Appendix C More Examples of our Generated Data

In Fig. 7, we present the detailed captions generated by our method. Compared to the original annotations from the CC3M , our generated descriptions cover many more details of the image, providing a more detailed description of the image.

Appendix D Comparison with other LMMs.

The comparison results of the VQA task in Fig. 8 indicate that after applying our method of scaling up the model size, Monkey has achieved significant performance advantages in tasks related to dense text. It not only surpasses the performance of QwenVL-Chat , LLaVA-1.5 , and mPLUG-Owl2 but also achieves promising results compared to GPT-4V in tasks related to dense text. This clearly demonstrates the importance of scaling up the model size for performance improvement in multimodal large models. It further validates the effectiveness of our method in enhancing the performance of multimodal large models.

In Fig. 9, the comparison between Monkey and GPT-4V, QwenVL-Chat, LLaVA-1.5, and mPLUG-Owl2 on Detailed Caption task is shown. It can be observed that Monkey accurately describes the content of the image and exhibits high sensitivity to the text within the image. It provides detailed descriptions of the image while ensuring accuracy.

Appendix E Visualization results for models at different resolutions.

In Fig. 10, we performed VQA tasks testing at three different resolutions: 896, 784, and 672. The visual results obtained further validate the importance of our size expansion method for improving the performance of LMMs. While using a resolution of 896 for VQA tasks testing yielded correct results, using resolutions of 784 and 672 resulted in errors, with the smallest size of 672 showing more errors.

In Fig. 11, we conducted tests at three different resolutions: 896, 784, and 672. It can be observed that as the resolution decreases, the details in the images gradually become less visible to the model.

Appendix F Data Generation.

Hyperparameter Control in Data Generation Pipeline. The appropriate selection of hyperparameters is crucial. We empirically selected them based on qualitative results, finding SAM’s default threshold and a 0.5 Image-Text Matching Score to be effective. We conducted a quantitative validation on 80 samples using the GPT-4V evaluation. The results shown in Tab. 8 reveal that SAM’s threshold is relatively robust, and the 0.5 threshold for Image-Text Matching Score offers a better performance.

Comparison with LLaVA’s GPT4 Method. While the GPT4 method in LLaVA is dependent on using manually annotated captions from the COCO dataset as a foundational basis for data generation, our approach focuses on generating original, detailed captions autonomously. Additionally, our detectors are skilled in revealing a spectrum of details in images, from text to nuanced object characteristics, which enables to enrich unlabeled data by extracting complex, multi-level details, paving the way for the creation of both cost-effective and accurate image descriptions.

Why choose BLIP2? We found that the performance is very similar in the GPT-4V evaluation when utilizing brief descriptions of local areas from other VLMs, as shown in Tab. 9. However, for generating approximately 5M descriptions, using BLIP2 takes approximately 3 days, while LLaVA and mPLUG-Owl require about 21 days and 32 days, respectively. For the sake of saving time, we choose BLIP2.

Appendix G Ablation study on Global Feature.

We conducted experiments on the presence or absence of global features at a resolution of 896. By adding global features, the results showed a 7.5% performance gain on TextVQA, a 0.6% performance gain on GQA, and a 6.2% performance gain on DocVQA. This demonstrated that global features contribute to enhancing the overall performance.