How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang

cs.CV

Introduction

Large language models (LLMs) have been instrumental in advancing artificial general intelligence (AGI) systems, demonstrating remarkable abilities in processing open-world language tasks. Leveraging the advancements in LLMs, multimodal large language models (MLLMs) have made significant strides, facilitating complex vision-language dialogues and interactions that bridge the gap between textual and visual information. Despite these achievements, there remains a noticeable divide between the capabilities of open-source models and proprietary commercial models, e.g., GPT-4V , Gemini series , and Qwen-VL-Max .

This gap is mainly reflected in the following three aspects: (1) Parameter Scale: Recent proprietary commercial MLLMs typically scales not less than 100 billion parameters, while open-source models commonly employ a 300 million parameter vision foundation model (VFM), which is integrated with either a 7 billion or 13 billion LLMs. (2) Image Resolution: Proprietary commercial models typically employ a dynamic resolution approach, preserving the original aspect ratio to facilitate detailed scene and document understanding. In contrast, open-source models generally train with fixed resolutions , such as 336 $\times$ 336 and 448 $\times$ 448, leading to a considerable gap in capabilities relative to commercial counterparts. (3) Multilingual Capability: Proprietary models often leverage extensive multilingual datasets for training, enhancing their performance across diverse languages. However, open-source models predominantly utilize English data, relying on the zero-shot capabilities of LLMs for other languages, e.g. LLaVA-NeXT . This results in sub-optimal performance in non-English scene understanding and OCR tasks.

To bridge the gap, we introduce InternVL 1.5, integrating three major improvements to enhance its performance and usability. (1) We implement a continuous learning approach to a large-scale VFM—InternViT-6B , refining it using high-quality image-text data. This process not only enhances the model’s ability to understand visual content but also improves its adaptability across various LLMs. In addition, using InternLM2-20B as the language foundation model also offers robust initial language processing capabilities. (2) We adopt a dynamic high-resolution strategy that segments images into 448 $\times$ 448 tiles, with the number of tiles ranging from 1 to 40 (i.e., 4K resolution) based on the aspect ratio and resolution of the images. To capture global context, we additionally include a thumbnail view. (3) We gather a diverse collection of public datasets, covering high-quality natural scenes, charts, documents, and conversations in both English and Chinese. Additionally, we develop a data translation pipeline using open-source LLMs, which can be easily extended to more languages.

These designs endow our model with several advantages: (1) Flexible Resolution: Similar to the “low” or “high” modes available in GPT-4V , InternVL 1.5 enables users to select the optimal resolution for their images, such as using low-resolution for scene subject description and high-resolution (up to 4K resolution) for document understanding, effectively balancing computational efficiency with detail preservation. (2) Bilingual Proficiency: InternVL 1.5 exhibits robust bilingual capabilities, proficiently handling multimodal perception and understanding tasks in both English and Chinese. Notably, in tasks related to Chinese, our model generally outperforms the leading commercial model GPT-4V . (3) Strong Visual Representation: By implementing a continuous learning strategy, we enhance the visual representation capabilities of InternViT-6B , making it robust to flexible input resolution and various visual domains. Benefitting from InternViT-6B’s massive parameters, our model achieves a level of visual representation that rivals the linguistic capabilities of LLMs with more than 20 billion parameters. This synergy between visual and linguistic processing endows our system with robust multimodal capabilities.

We evaluated InternVL 1.5 on 18 representative multimodal benchmarks, which are categorized into four specific groups: OCR-related, general multimodal, mathematical, and multi-turn conversation benchmarks. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Notably, as shown in Figure 1, it even surpasses leading proprietary models like Grok-1.5V , GPT-4V , Claude-3 Opus , and Gemini Pro 1.5 in four specific benchmarks, particularly in OCR-related datasets such as TextVQA , ChartQA , and DocVQA . This evaluation indicates that InternVL 1.5 has effectively narrowed the gap between open-source models and leading commercial models. We hope that our approach and open-source model weights can contribute to the development of the MLLM community.

Related Work

Large language models (LLMs) have greatly advanced AGI by enabling complex language tasks previously thought human-exclusive. Building on this, the development of proprietary commercial MLLMs represents a significant evolution. For example, OpenAI’s GPT-4V extends GPT-4’s capabilities by incorporating visual inputs, allowing it to handle both text and image content, which stands as a significant development in the domain of MLLMs. Afterward, Google’s Gemini series progresses from Gemini 1.0 to Gemini 1.5 , enhancing MLLMs with the ability to process text, images, and audio and support up to 1 million tokens, which boosts performance significantly. The Qwen-VL-Plus/Max are Alibaba’s leading models in the Qwen-VL series , renowned for superior capacity in multimodal tasks without needing OCR tools. Recent advancements in proprietary MLLMs include Anthropic’s Claude-3V series , HyperGAI’s HPT Pro , Apple’s MM1 , StepFun’s Step-1V , and xAI’s Grok-1.5V .

2 Open-Source MLLMs

The development of open-source MLLMs has significantly influenced the AGI landscape by integrating and enhancing capabilities in processing both visual and textual data. Over the past year, many open-source MLLMs have become well-known, including the LLaVA series , MiniGPT-4 , VisionLLM , Qwen-VL , CogVLM , Shikra , and others . However, these models are typically trained on images with small, fixed resolutions such as 336 $\times$ 336, or 448 $\times$ 448, which leads to sub-optimal performance on images with unusual aspect ratios or document data. To address this issue, many approaches have been explored for training on high-resolution images. Currently, there are two common technical routes: one involves designing a dual-branch image encoder , and the other involves dividing a high-resolution image into many low-resolution tiles . Despite these explorations in high-resolution training, these open-source models still exhibit significant gaps in understanding documents, charts, and infographics, as well as recognizing scene texts, compared to leading commercial models.

3 Vision Foundation Models for MLLMs

Vision foundation models (VFMs) are a focal point of research within the MLLM community. Currently, models like CLIP-ViT and SigLIP are prevalently utilized; however, many studies have been conducted to find the most suitable vision encoders for MLLMs . For instance, Tong et al. observed notable differences in the visual patterns of CLIP and DINOv2 , leading to the development of a mixture-of-features module that combines these two VFMs. LLaVA-HR introduced a dual-branch vision encoder utilizing CLIP-ViT for low-resolution pathways and CLIP-ConvNext for high-resolution pathways. Similarly, DeepSeek-VL adopted a dual vision encoder design, using SigLIP-L for low-resolution images and SAM-B for high-resolution images. In this report, we propose a continuous learning strategy for our vision foundation model—InternViT-6B , which continuously boosts the visual understanding capabilities and can be transferred and reused across different LLMs.

InternVL 1.5

As illustrated in Figure 3, InternVL 1.5 employs an architecture akin to widely-used open-source MLLMs, specifically the “ViT-MLP-LLM” configuration referenced in various existing studies . Our implementation of this architecture integrates a pre-trained InternViT-6B with a pre-trained InternLM2-20B using a randomly initialized MLP projector.

During training, we implemented a dynamic resolution strategy, dividing images into tiles of 448 $\times$ 448 pixels in sizes ranging from 1 to 12, based on the aspect ratio and resolution of the input images. During testing, this can be zero-shot scaled up to 40 tiles (i.e., 4K resolution). To enhance scalability for high resolution, we simply employed a pixel shuffle operation to reduce the number of visual tokens to one-quarter of the original. Therefore, in our model, a 448 $\times$ 448 image is represented by 256 visual tokens.

2 Strong Vision Encoder

In existing MLLMs , the most commonly used vision foundation model is typically a contrastively pre-trained ViT . However, these ViTs are commonly trained on image-text pairs crawled from the Internet at a fixed low resolution (e.g., 224 $\times$ 224), so their performance degrades when tasked with processing high-resolution images or images from sources other than the Internet, such as document images.

InternViT-6B-448px-V1.2. To address this issue, the InternVL 1.2 update involved continuous pre-training of the InternViT-6B model. First, we found that the features from the fourth-to-last layer perform best for multimodal tasks, so we directly discarded the weights of the last three layers, reducing InternViT-6B from 48 layers to 45 layers. Then, we increased the resolution of InternViT-6B from 224 to 448 and integrated it with Nous-Hermes-2-Yi-34B . To equip the model with high-resolution processing and OCR capabilities, both the vision encoder and the MLP were activated for training, utilizing a mix of image captioning and OCR-specific datasets . The newly derived InternViT weights from this process were released as InternViT-6B-448px-V1.2https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2.

InternViT-6B-448px-V1.5. The development of InternVL 1.5 continues the pre-training of the strong foundation of InternViT-6B-448px-V1.2. In this update, the resolution of training images is expanded from fixed 448 $\times$ 448 to dynamic 448 $\times$ 448, where the basic tile size is 448 $\times$ 448 and the number of tiles ranges from 1 to 12. Additionally, we enhance the data scale, quality, and diversity of the pre-training dataset, resulting in the powerful robustness, OCR capability, and high-resolution processing capability of our 1.5 version modelhttps://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5. Details of the dynamic resolution and training datasets are described in Sections 3.3 and 3.4.

It is noteworthy that despite the LLM in InternVL 1.5 being changed from Nous-Hermes-2-Yi-34B to InternLM2-20B , the InternViT maintained excellent compatibility and portability with the new LLM. This suggests that the visual features learned by InternViT-6B during the pre-training stage of MLLMs are broadly applicable and not tightly bound to the specific LLM.

3 Dynamic High-Resolution

Inspired by UReader , we adopt a dynamic high-resolution training approach that effectively adapts to the varying resolutions and aspect ratios of input images. This method leverages the flexibility of segmenting images into tiles, enhancing the model’s ability to process detailed visual information while accommodating diverse image resolutions. It mainly consists of the following steps:

Dynamic Aspect Ratio Matching. As shown in Figure 4, to maintain natural aspect ratios during processing, we dynamically match the optimal aspect ratio from a pre-defined set of aspect ratios. Due to limited computational resources, we allow a maximum of 12 tiles during training. Consequently, this set includes all 35 possible combinations of aspect ratios formed by 1 to 12 tiles, such as {1:1, 1:2, 2:1, 3:1, …, 2:6}. During the matching process, for each input image, we calculate its aspect ratio and compare it with the 35 pre-defined aspect ratios by measuring the absolute difference. If multiple pre-defined aspect ratios match (e.g., 1:1 and 2:2), we prioritize the one not exceeding twice the input image’s area, thereby preventing excessive enlargement of low-resolution images.

Image Division & Thumbnail. Once an appropriate aspect ratio is determined, the image is resized to the corresponding resolution. For example, an 800 $\times$ 1300 image will be resized to 896 $\times$ 1344. The resized image is then divided into tiles of 448 $\times$ 448 pixels. Alongside the tiles, we include a thumbnail of the entire image to capture the global context. This thumbnail is scaled down to 448 $\times$ 448, aiding the model in understanding the overall scene. Therefore, during training, the number of visual tokens ranges from 256 to 3,328. During testing, the number of tiles can increase to a maximum of 40, resulting in 10,496 visual tokens.

4 High-Quality Bilingual Dataset

Pre-training Dataset. The pre-training dataset utilized in our InternVL 1.5 encompasses a diverse range of publicly accessible sources. We provide an overview of these datasets in Table 1(a). These datasets span multiple tasks, including captioning, which predominantly uses datasets such as Laion-EN , Laion-ZH , COYO , and GRIT , constituting 53.9% of the total data. Detection and grounding tasks utilize datasets like Objects365 , GRIT , and All-Seeing , making up 5.2%. For OCR tasks, we utilized large-scale datasets such as Wukong-OCR, LaionCOCO-OCR, and Common Crawl PDFs, which constitute 32.0% of our data. These datasets were constructed using PaddleOCR to perform OCR on Chinese images from Wukong and on English images from LaionCOCO . Smaller OCR datasets include MMC-Inst , LSVT , ST-VQA , RCTW-17 , ArT , and others, accounting for 8.9% of the data, which focus on more specific or constrained OCR challenges. This diverse dataset assembly ensures robust model pre-training of InternVL, catering to varied linguistic and visual elements across tasks.

Fine-tuning Dataset. During the fine-tuning stage, we meticulously selected datasets to enhance model performance across a wide range of multimodal tasks. The datasets used in this phase are summarized in Table 1(b).

For image captioning, we included TextCaps and bilingual ShareGPT4V , which help the model learn to generate descriptive captions in both English and Chinese. In the domain of general QA, datasets such as VQAv2 , GQA , and VisualDialog teach the model to handle diverse question-answering scenarios.

For scientific image understanding, datasets like AI2D , ScienceQA , and TQA provide content-rich scenarios to enhance the model’s ability to interpret scientific diagrams and texts. Chart interpretation is bolstered by ChartQA , MMC-Inst , and PlotQA , which train the model to analyze and understand chart images. Mathematics datasets such as GeoQA+ , TabMWP , and MathQA introduce complex numerical and geometric problem-solving tasks. Knowledge-based QA benefits from the inclusion of datasets like KVQA and bilingual Wikipedia , enabling the model to extract and reason with factual information across multiple languages.

For tasks involving OCR, we utilize OCRVQA , TextVQA , and several datasets focused on Chinese and English text recognition, such as SynthDoG , to improve text recognition from images. Document understanding is advanced through datasets like DocVQA and Common Crawl PDFs, which help the model for real-world document analysis. Visual grounding is trained using RefCOCO and Visual Genome , aiding the model in precise object localization within images. In the realm of multimodal conversation, datasets like LLaVA-150K and ALLaVA enhance the model’s dialogic capabilities by simulating interactive and engaging scenarios. Lastly, text-only datasets include OpenHermes2.5 , Alpaca-GPT4 , among others , which are used to maintain the original linguistic capabilities of the LLM.

In summary, these datasets together establish a rich and diverse foundation for fine-tuning, which enhances our model’s ability to handle a wide range of multimodal tasks and ensures its readiness for practical applications.

Data Translation Pipeline. As shown in Figure 5, to enhance our model’s multilingual capabilities, we implemented a data translation pipeline. This pipeline utilizes state-of-the-art open-source LLMs or GPT-3.5 to convert English datasets to another language (e.g., Chinese), maintaining consistency and precision in bilingual labeling. Moreover, it can readily expand to encompass more languages by adjusting the language prompt, without relying on manual annotation processes.

In Table 1, we have annotated the language for each dataset. For a dataset that was originally in English, an annotation as “zh” indicates that we have translated it into Chinese using the translation pipeline. For example, COYO and GRIT were originally English datasets, and we have translated them into Chinese. By leveraging this translation pipeline, the Chinese capabilities of InternVL 1.5 have been greatly enhanced.

Experiments

InternVL 1.5 was developed by integrating the InternViT-6B vision encoder with the InternLM2-20B language model, using a dynamic high-resolution strategy. In this approach, images are segmented into 448 $\times$ 448 pixel tiles, with the number of tiles ranging up to 12 based on the image’s aspect ratio and resolution during training. In testing phases, the model could handle up to 40 tiles, equivalent to 4K resolution, demonstrating its adaptability to high-resolution inputs in a zero-shot manner. Notably, we built our model based on the chat version of InternLM2-20B rather than the base model.

The training of InternVL 1.5 was divided into two stages. Initially, the pre-training stage focused on training the InternViT-6B vision encoder and the MLP projector to optimize visual feature extraction. Subsequently, the entire model’s 26 billion parameters were fine-tuned to enhance multimodal capabilities. In both two stages of training, we use a context length of 4096 and adopt the same response formatting prompts as LLaVA 1.5 . Additionally, the evaluation was mainly supported by VLMEvalKit .

2 Comparison with State-of-the-Art MLLMs

In this section, we conduct an extensive evaluation across a series of benchmarks to assess our model’s multimodal understanding and reasoning capability. The benchmarks employed in our study are categorized into four distinct types: OCR-related, general multimodal, mathematical, and multi-turn conversation benchmarks. As depicted in Table 2, InternVL 1.5 exhibits leading performance across the majority of these benchmarks.

OCR-related Image Understanding. We evaluate the model performance across four key dimensions of OCR: document comprehension (DocVQA ), chart understanding (ChartQA ), infographic understanding (InfographicVQA ), and scene text interpretation (TextVQA ). Additionally, we employ OCRBench to perform a comprehensive evaluation of the model’s overall OCR capabilities. As shown in Table 2, our model demonstrated comparable performance to proprietary models on these benchmarks and significantly outperformed the open-source LLaVA-NeXT as well as InternVL 1.2, the predecessor of InternVL 1.5. Notably, our model achieves state-of-the-art performance on ChartQA and OCRBench, outperforming all competing proprietary models.

General Multimodal Evaluation. In addition to OCR-related benchmarks, we tested our model on several general multi-modal benchmarks. We used RealWorldQA to evaluate the model’s real-world spatial understanding capabilities. HallusionBench was employed to assess its ability to control hallucinations. Additionally, MMMU was utilized to evaluate the model’s multidisciplinary capabilities, and AI2D to assess its understanding of science diagrams. We also tested the model’s proficiency in Chinese and understanding of Chinese culture with the MMBench-CN test and CCBench , respectively. Other comprehensive benchmarks such as MME , MMBench-EN , MMVet , SEED , and MMT-Bench were also used to assess the model’s visual understanding and reasoning abilities.

Compared to other open-source models like Text-Monkey , DocOwl-1.5 , and LLaVA-NeXT , our InternVL 1.5 significantly closes the gap with proprietary models in these benchmarks. Specifically, our model achieves the best performance on HallusionBench , demonstrating its outstanding ability to reduce hallucinations. Moreover, thanks to our high-quality bilingual dataset, our model exhibits robust Chinese language capabilities, significantly surpassing both open-source and proprietary methods on MMBench-CN and CCBench. However, while InternVL 1.5 surpasses MM1 and is comparable to Gemini Pro 1.0 on MMMU, it shows a slight decline from its predecessor, InternVL 1.2. We attribute this modest decrement to the smaller size of the language model, a phenomenon similarly observed in the MMT-Bench results, as shown in Table 3.

Math Reasoning. MathVista is a benchmark designed to integrate challenges from various mathematical and visual tasks. Completing these tasks requires a deep understanding of visuals, logical thinking, and math knowledge—areas where many proprietary commercial models encounter significant difficulties. As shown in Table 2, our model outperforms others, including GPT-4V , by a clear margin in this benchmark, showcasing its ability to handle mathematically demanding tasks.

Multi-Turn Conversation. Compared to single-turn dialogues, multi-turn conversations align more with human preferences. In practical usage, multi-turn dialogue is the preferred mode for general-purpose assistants to engage with humans in solving a variety of tasks. Therefore, we opt to utilize ConvBench for evaluating multi-turn conversations, which progressively assesses the perception, reasoning, and creativity capabilities of MLLMs. As depicted in Table 3, InternVL exhibits leading performance among open-source models, albeit still trailing behind GPT-4V by a considerable margin. Going forward, we will continue refining InternVL’s capabilities in multi-turn conversations.

3 Ablation Study

Larger LLMs need Larger VFMs. In this study, we investigate the interplay between LLMs and VFMs. The comparison involves two open-source MLLMs, LLaVA-NeXT and InternVL 1.2, each equipped with LLMs of 34 billion parameters. Notably, although both models employ LLMs of the same scale, InternVL 1.2 incorporates a significantly larger VFM, with 6 billion parameters, compared to LLaVA-NeXT’s 300 million parameters. Since the data for LLaVA-NeXT is not available, we created a similar dataset ourselves. Additionally, InternVL 1.2 was trained at a fixed resolution of 448 $\times$ 448, while LLaVA-NeXT used a higher dynamic resolution of $672\times 672$ . Therefore, this comparison is not entirely fair or equivalent. Nevertheless, the findings still reveal noteworthy insights. For example, after excluding five OCR-related datasets, ConvBench, and RealWorldQA, InternVL 1.2 outperformed LLaVA-NeXT in 9 out of the remaining 11 datasets. This performance difference supports our hypothesis that for a large-scale LLM (e.g., 34B), a larger VFM (e.g., 6B) can effectively improve the model’s ability to handle complex multimodal tasks, thereby enhancing the overall performance.

Dynamic Resolution Matters. As shown in Figure 6, we investigated the effectiveness of dynamic resolution across various multimodal benchmarks. We found that not all tasks require high resolution. Specifically, tasks related to OCR, such as DocVQA, InfoVQA, TextVQA, and OCRBench, benefit from increased resolution. However, tasks like AI2D, MMMU, MMBench, and HallusionBench exhibit a slight decline in performance at higher resolutions. Overall, InternVL 1.5 demonstrates strong robustness to dynamic resolution. It can adjust the resolution based on the specific requirements of each task, ensuring optimal performance where high resolution is beneficial and conserving resources where it is not.

In previous sections, we evaluated our model across various benchmarks and observed its strong performance. In this section, we conduct a qualitative comparison of our model with GPT-4V across diverse scenarios, including General QA, OCR-related QA, Scientific Understanding, Chinese Traditional Culture, Object Localization, and Multi-Image Dialogue. We aim to demonstrate the practicality and versatility of our model in real-world applications, offering insights from the perspective of actual user experience.

General QA. To compare the general capabilities of InternVL 1.5 and GPT-4V, we first conducted an experiment involving simple user queries with images requiring general knowledge. As shown on the left side of Figure 7, both two models respond accurately to the query, showcasing their proficiency in general topics. As shown on the right side of Figure 7, GPT-4V may excessively refuse to answer some questions due to its involvement in personal privacy.

OCR-Related QA. We conducted an evaluation to compare the OCR capabilities of our InternVL 1.5 model against GPT-4V. On the left side of Figure 8, the first prompt aimed to measure the models’ ability to understand Chinese scenes. In this instance, GPT-4V cannot extract all useful information in the image. On the right side of Figure 8, both GPT-4V and our model have good performance on chart understanding.

Scientific Understanding. Evaluating the capabilities of models in scientific understanding reasoning tasks is essential for advancing computational intelligence, particularly in contexts requiring in-domain knowledge and logical reasoning. In our study, we compared the performance of our InternVL 1.5 model with GPT-4V by administering complex multi-disciplinary problems designed to assess the accuracy of their reasoning. In Figure 9, for the first question, both models accurately answered and provided an analysis from an aerodynamic perspective. For the second question, our model precisely analyzed the elements depicted in the image and provided the correct response, whereas GPT-4V speculated on the trend of amino acid transport. These results suggest that our method and GPT-4V exhibit comparable capabilities in scientific understanding and reasoning.

Chinese Traditional Culture. We selected two typical multimodal examples related to traditional Chinese art to evaluate our model. As illustrated in Figure 10, both InternVL 1.5 and GPT-4V correctly recognize the Chinese traditional culture depicted in the image. Notably, InternVL 1.5 demonstrates a deeper understanding of this culture, as evidenced by its more detailed descriptions of the cultural elements in its response.

Object Localization. Evaluating machine learning models for their proficiency in object localization tasks is essential, especially in applications requiring precise spatial awareness. In our comparative analysis, the performance of the InternVL 1.5 model was juxtaposed with GPT-4V, focusing on their ability to accurately detect and localize objects within various environments. Our assessments ranged from simple object recognition in cluttered scenes to complex scenarios involving dynamic interactions among multiple entities. As illustrated in Figure 11, the results demonstrate that InternVL 1.5 not only localized objects with high accuracy but also exhibited a comparable understanding of spatial relationships, matching the performance of GPT-4V.

Multi-Image Dialogue. As shown in Figure 12, in this experiment, we ask InternVL 1.5 and GPT-4V to compare the similarities and differences between the two images. As can be seen, both GPT-4V and InternVL 1.5 provide detailed and accurate responses. Through this experiment, we discovered that although InternVL 1.5 was trained solely on single-image inputs, it exhibits strong zero-shot capabilities for multi-image dialogues.

Conclusion

This work introduced InternVL 1.5, an open-source MLLM designed to narrow the performance gap between open-source and proprietary models in multimodal understanding. By integrating a strong vision encoder with continuous learning capabilities, adopting a dynamic high-resolution strategy, and utilizing a high-quality bilingual dataset, InternVL 1.5 has demonstrated robust performance across a variety of benchmarks. Our evaluations indicate that the model achieves competitive performance with leading proprietary models, excelling particularly in OCR-related tasks and showing significant improvements in Chinese-related scene understanding. While InternVL 1.5 has contributed to the open-source multimodal understanding, the field continues to evolve with many challenges ahead. We aspire to further enhance InternVL’s capabilities and invite collaboration with the global research community, hoping to enrich and expand the reach of open-source models together.