KOSMOS-2.5: A Multimodal Literate Model

Tengchao Lv, Yupan Huang, Jingye Chen, Yuzhong Zhao, Yilin Jia, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, Furu Wei

cs.CL cs.CV

Introduction

Over the past several years, large language models (LLMs) have emerged as a critical area of research in artificial intelligence. These models are designed to learn from massive amounts of natural language data, allowing them to perform a wide range of language-related tasks with impressive accuracy. This development has been fueled by advancements in model scaling that enabled researchers to create models with unprecedented complexity. As a result, LLMs have become increasingly prevalent across various industries and applications, from customer service chatbots to virtual assistants and automated content creation. One notable trend in recent years has been the focus on building larger and more complex models, such as GPT-3 and GPT-4 , which has hundreds/thousands of billion parameters and can generate compelling language outputs. While these models require significant computing resources to train and operate, they hold enormous potential for revolutionizing how we interact with and understand natural language.

Current LLMs primarily focus on textual information and cannot understand visual information. However, advancements in the field of multimodal large language models (MLLMs) aim to address this limitation. MLLMs combine visual and textual information within a single Transformer-based model, enabling the model to learn and generate content based on both modalities. MLLMs have shown promise in a variety of real-world applications, including natural image understanding and text image understanding. These models leverage the power of language modeling as a general interface for multimodal problems, allowing them to process and generate responses based on textual and visual inputs. While existing MLLMs have mainly focused on natural images with lower resolutions, the exploration of text images is an area that requires further investigation. Taking advantage of large-scale multimodal pre-training for text images is an important direction for MLLM research. By incorporating text images into the training process and developing models based on textual and visual information, we can unlock new possibilities for multimodal applications involving high-resolution text-intensive images.

In this study, we present Kosmos-2.5, a multimodal literate model that takes advantage of Kosmos-2 designed to tackle machine reading of text-intensive images, which is shown in Figure 1. Kosmos-2.5 performs two closely related transcription tasks in a unified multimodal model. The first task generates spatially-aware text blocks, assigning text lines their corresponding spatial coordinates within the original text-rich image. The second task produces structured text output, capturing styles and structures in the markdown format. Both tasks are conducted under a unified framework, leveraging a shared Transformer architecture, task-specific prompts, and flexible text representations. Specifically, our model architecture combines a ViT-based vision encoder and a Transformer-based language decoder linked by a resampler module. Our model is pre-trained on a large corpus of text-intensive images, whose text representations include text lines with bounding boxes and plain markdown texts. By employing this dual-task training strategy, Kosmos-2.5 enhances its general-purpose multimodal literate capabilities. We assess the performance of Kosmos-2.5 on two tasks: end-to-end document-level text recognition and markdown-formatted image-to-text generation. Experiment results have demonstrated strong literate performance on several text-intensive image understanding tasks. In addition, Kosmos-2.5 also demonstrates promising capabilities in few-shot and zero-shot learning scenarios, offering a universal interface for real-world applications that involve text-rich images.

The contributions of this work are summarized as follows:

Kosmos-2.5 represents a significant paradigm shift in text image understanding, transitioning from encoder-only/encoder-decoder models to a decoder-only model. It is pre-trained by incorporating dual transcription tasks (spatially-aware text block generation and structured markdown text generation) into a single, unified model architecture.

This innovative method streamlines the application interface by integrating generative multimodal language modeling, simplifying the traditionally complex cascaded pipelines used for various downstream tasks.

Furthermore, Kosmos-2.5 demonstrates impressive multimodal literate capabilities, thus setting the stage for future scaling of multimodal large language models.

Kosmos-2.5

The model architecture of Kosmos-2.5 consists of a pre-trained vision encoder and a language decoder connected with a resampler module, shown in Figure 2. We adopt the pre-trained vision encoder based on the Vision Transformer (ViT) . We further adapt a Perceiver Resampler module with an attentive pooling mechanism to reduce the size of image embeddings . The language decoder is built upon the Transformer-based decoder to condition on image and text context for the next token prediction.

2 Image and Text Representations

Kosmos-2.5 takes a composite input consisting of an image and a text representation. The image representation is uniform across various configurations and leverages a variable-resolution input strategy following Pix2Struct . Precisely, we extract the maximum number of fixed-size patches ( $16\times 16$ ) that can fit within a predefined sequence length $L$ . In addition, Resampler is used as an attentive pooling mechanism to reduce the number of image embeddings. The text representation, however, is more versatile and can be one of two types: text lines with bounding boxes or plain markdown texts.

Text lines with bounding boxes: For the layout-based document representation, text lines and their associated bounding boxes are extracted. Inspired by Kosmos-2 , we ground the text lines to their spatial positions in images by aligning their representations. The coordinates of these bounding boxes are then converted into discrete location tokens. Given that $L$ also represents the maximum length for each image dimension, we introduce a set of $2L+2$ specialized tokens. These tokens, , , …, , , …, , , and , correspond to the coordinates and the start and end of a bounding box. The coordinates are obtained by rounding down the actual position after resizing images. Consider a document $T$ that comprises $N$ text lines. Each line is represented as $\mathbf{T}_{n}=\{w_{1}^{(n)},w_{2}^{(n)},\ldots,w_{M_{n}}^{(n)}\}$ , where $M_{n}$ is the number of words in the $n$ -th text line. The bounding box for $\mathbf{T}_{n}$ is then denoted by $\mathbf{B}_{n}=\texttt{<bbox><}x_{\text{tl}}^{(n)}\texttt{><}y_{\text{tl}}^{(n)}\texttt{><}x_{\text{br}}^{(n)}\texttt{><}y_{\text{br}}^{(n)}\texttt{></bbox>}$ , which includes coordinates for its top-left and bottom-right corners.

Markdown texts: For the markup-based document representation where the output text is in the markdown format, the text component captures both content and formatting markup. Unlike layout-based documents, markdown text does not require bounding boxes. Instead, the text is directly tokenized, retaining all special characters and formatting indicators.

To facilitate these diverse input types, we employ different composite representations. For image-text pairs with text lines and bounding boxes, the input is denoted as ~~Image Embedding $\bigcup_{n=1}^{N}$ ( $\mathbf{B}_{n}\oplus\mathbf{T}_{n})$~~ . The operator $\oplus$ represents the concatenation of the text line $\mathbf{T}_{n}$ and its bounding box $\mathbf{B}_{n}$ . Conversely, when the text is in the markdown format, the input simplifies to ~~Image EmbeddingMarkdown Text~~. In both cases, ~~and~~ signify the sequence boundaries, while and indicate the beginning and end of image embeddings. This flexibility in text representation allows Kosmos-2.5 to apply to various document analysis tasks.

3 Pre-training Data

The pre-training process enables Kosmos-2.5 to learn versatile representations suitable for various text-intensive image understanding tasks. The model is pre-trained on a rich array of datasets from diverse sources. Traditional Optical Character Recognition (OCR) task is primarily geared towards generating text content and its 2D positions within an image. However, they often neglect the need to maintain the order and structural integrity of the original document, which is essential for text-intensive image understanding tasks involving structured information.

To address this, we steer Kosmos-2.5 to excel in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. Markdown provides an advantage over plain text by explicitly distinguishing different structural elements, such as tables and lists, with specific tokens. For example, table cells can be denoted with vertical bars (|) and list items with bullets (*, -, or +). It also standardizes the representation of typographic emphases like bold (**bold**) and italics (*italics*), integrating the learning of document structure with natural language understanding in a unified model.

IIT-CDIP: The IIT-CDIP dataset is a large-scale public collection comprising scanned document images. We used approximately 27.6 million pages to train our model.

arXiv papers: arXiv, an open-access research-sharing platform, provides another significant data source, accounting for roughly 20.9 million pages. We downloaded a bulk of data, consisting of PDF and LaTeX source files, from the official arXiv repositoryhttps://info.arxiv.org/help/bulk_data/index.html.

PowerPoint slides: A corpus of 6.2 million pages is collected from various web pages containing PowerPoint documents, significantly enhancing the diversity of our training data.

General PDF: Additionally, we crawled the web for diverse open-domain digital PDF files, leading to the collection of a large corpus comprising approximately 155.2 million pages.

Web screenshots: A subset of the mC4 webpages is scraped and rendered as screenshots containing almost 100 million pages.

For structured text output in markdown format, we use:

README: We collect 2.9 million “README.md” files from open-source GitHub projects, primarily written in markdown format.

DOCX: We also extract 1.1 million DOCX pages from millions of WORD files crawled from the web. The DOCX pages are converted to markdown format, and each page corresponds to its markdown information.

LaTeX: A subset of the entire arXiv papers is used to extract the mapping of PDF pages and its corresponding markdown information converted from the LaTeX code, which contains a total of 3.7 million pages.

HTML: We obtain 6.3 million HTML files from the aforementioned mC4 subset and convert them into markdown format.

4 Data Processing

The pre-training data has a wide coverage, and each type of data requires a different processing workflow, which is introduced as follows:

The IIT-CDIP dataset mainly consists of scanned document images. We use the Microsoft Read API https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr#read-api to extract text and layout information.

arXiv papers, PowerPoint slides, General PDF

We first compile and convert arXiv papers and PowerPoint slides into PDF files. Together with other general PDFs, we employed the PyMuPDF parser https://github.com/pymupdf/PyMuPDF to extract text and layout information efficiently.

Web screenshots

We also include webpage screenshots in the model pre-training to diversify the layout distribution further. We collect the webpage URLs from the English portion of the mC4 dataset. Playwright https://github.com/microsoft/playwright-python is used to access a specified URL and open the webpage. The HTML content of the page is extracted and parsed using the lxml library https://lxml.de/ to obtain a Document Object Model (DOM) tree representation. This DOM tree is traversed, examining the XPath of each element within it. This traversal aims to determine whether each element is visible and retrieve information about its bounding boxes.

README (markdown)

In addition to layout-based data, we collect markup-based data for the pre-training. We collect “README.md” files from many GitHub projects and convert these files into HTML using Pandoc https://pandoc.org/. Then, wkhtmltopdf https://wkhtmltopdf.org/ is used to obtain the images from the generated HTML content.

DOCX (markdown)

The Microsoft Office WORD files have been extensively used in existing research like TableBank and ReadingBank . We collect WORD DOCX files and convert them into texts with markdown. First, we use Pandoc to convert the XML content within the DOCX files into markdown files. As Pandoc keeps the “

” tags to represent the tabular cells in the generated markdown, we further identify all the tables and use markdownify https://github.com/matthewwithanm/python-markdownify to convert them into the markdown formats. Finally, the original DOCX files are converted into PDF files, and each page is aligned to the corresponding span of the markdown content based on a heuristic method.

LaTeX (markdown)

LaTeX documents from arXiv have been used to generate PDF files to obtain texts with bounding boxes. Meanwhile, we also convert the LaTeX content into the markdown texts. Similar to Nougat , LaTeXML https://math.nist.gov/~BMiller/LaTeXML/ is used to convert the LaTeX code into the HTML sequence, which is further transformed into the markdown format. Different from Nougat, we keep all the tables at the beginning of the page as most LaTeX users prefer to position tables with “[t]” or “[h]” instead of “[b]”. Meanwhile, we also convert the table content from the LaTeX format into the markdown format.

HTML (markdown)

The most straightforward way to obtain markdown resources from HTML webpages is through web scraping. However, webpages are often cluttered with various layouts and styles, resulting from the misuse of HTML tags. Moreover, HTML pages may include extraneous elements, such as advertisements, navigation menus, or formatting elements, making extracting clean and meaningful content challenging. To overcome these obstacles, we employ Playwright, a fast and reliable end-to-end testing framework for the web. The library allows us to navigate the HTML structure, filter out non-essential elements, and extract the relevant text content. We also apply custom rules and regular expressions to further refine the extracted text and format it as markdown, ensuring that the resulting markdown files are coherent and readable.

5 Filtering and Quality Control

We employ fastText for language identification (with a threshold of 0.5) to filter out non-English documents from the entire pre-training dataset. To ensure content diversity within each source, we utilize the MinHash to identify and remove redundant pages. We use the same parameters as and a document pair with similarity 0.8 will be marked as duplicate. A comprehensive breakdown of the pre-training data, along with their respective sampling ratios, is provided in Table 1. When dealing with image-to-markdown data from README, DOCX, LaTeX, and HTML sources, we observe discrepancies between the content in text images and their corresponding markdown sequences due to conversion issues. Consequently, we refine the data by evaluating token overlap between images and markdown files, requiring a token intersection-to-union ratio greater than 0.95 for inclusion. Section A.2 shows some of the training samples.

Experiments

We utilize word-level precision (# or correct matches over the number of detected words), recall (# of correct matches over the number of ground truth words), and f1 as the metrics to evaluate the text recognition performance. If there are repeated words in the ground truth, they are expected to be repeated in the prediction. Text recognition is evaluated on three benchmark datasets, including FUNSD , SROIE and CORD . We compare Kosmos-2.5 to the text recognition results from Document OCR in Google Document AI https://cloud.google.com/document-ai.

Image-to-markdown Generation

In light of the unique nature of the image-to-markdown conversion task, assessing the quality of the generated markdown necessitates specialized metrics. We adopt a two-fold evaluation scheme: Normalized Edit Distance (NED) and Normalized Tree Edit Distance (NTED), considering both the lexical accuracy and the preservation of the original structural elements.

However, given the hierarchical structure inherent to markdown, relying solely on a string-based comparison metric like NED can be insufficient. Thus, we adopt NTED as an additional evaluation metric for structural differences. NTED is a tree edit distance normalized by the number of nodes in the tree, considering the structural discrepancies between parse trees. Specifically, the predicted markdown sequence is first transformed into an HTML tree. Then, the tree edit distance between the prediction and the ground truth is calculated using the ZSS algorithm . The NTED is formulated as

We create three datasets to evaluate the image-to-markdown task from different data sources, including document-level markdown generation, README markdown generation and table markdown generation. Each dataset includes 1,000 $\langle$ image, markdown $\rangle$ pairs, which are held out from the pre-training data. We compare Kosmos-2.5 to the markdown generated by the Nougat base and small models.

2 Implementation Details

We employ the AdamW optimizer with $\beta=(0.9,0.98)$ for optimization, setting the weight decay to 0.01 and the dropout rate to 0.1. The learning rate is warmed up to $2\times 10^{-4}$ during the initial 375 steps, followed by a linear decay to zero throughout the remaining training steps. The batch size is adjustable to align with the available computational resources and specific training requirements. Kosmos-2.5 contains a total of 1.3 billion parameters. The vision encoder is initialized from the encoder of the Pix2Struct-Large model. The language decoder includes 24 Transformer layers with a hidden size of 1,536, an FFN intermediate size of 6,144, and 16 attention heads. Section A.1 shows more details of the training hyperparameters.

Due to the substantially larger quantity of available layout-based data than markup-based data, we initially trained the model for 100k steps exclusively using the layout-based dataset. Subsequently, the two datasets were combined for further training of 140k steps. Additionally, we incorporate the training split of the evaluation dataset into the entire pre-training data, extending the process by an additional 10k steps. For text tokenization, we utilize SentencePiece and adopt the “full-sentence” format . This approach packs each input sequence with full sentences, continuously sampled from one or multiple documents. Newly added word embeddings of location tokens are randomly initialized, with all parameters updated during training. We also leverage the data augmentation approaches from TrOCR in the training to make models more robust.

Throughout the evaluation process, model inference is conducted using a single model checkpoint across various evaluation datasets with the corresponding task prompt respectively, demonstrating that our approach does not necessitate individualized model fine-tuning for each dataset.

3 Results

Kosmos-2.5 is a flexible framework that facilitates multitasking, with tasks determined by the provided task prompts. Experimental results are demonstrated in Table 2 and Table 3. Specifically, for the text recognition task, our Kosmos-2.5 outperforms Google Document OCR by 0.33%, 2.45%, and 1.35% in terms of the F1 score, showcasing its effectiveness. For the image-to-markdown task, it is worth noting that our method significantly outperforms the Nougat . For example, Kosmos-2.5 achieves a notable improvement of 33.68% (95.09% vs 61.41%) over $\text{Nougat}_{\text{\,BASE}}$ in terms of NED on the README dataset. Besides, regarding NTED, Kosmos-2.5 also boosts the performance by 33.38% (82.08% vs 48.70%) compared with $\text{Nougat}_{\text{\,BASE}}$ on the Documents dataset. We attribute the performance boost to the increased diversity of our training data compared to Nougat, which primarily focuses on the academic paper domain. Notably, the greater diversity in our training data significantly enhances our model’s comprehension of different document types and strengthens its generalization capabilities. In summary, the experimental results validate the remarkable capabilities of Kosmos-2.5 in various tasks.

4 Discussion

We illustrate an example in Figure 3, showcasing the model outputs produced by Kosmos-2.5 with various task prompts when presented with the same input text image. As shown in the figure, the model generates distinct outputs depending on the task prompts it receives. When given the layout task prompt, the model produces the following text sequence, which includes textual content and corresponding bounding boxes:

With the markup task prompt, the model generates another text sequence that follows the markdown format:

It is apparent that Kosmos-2.5 excels in precisely identifying text positions and recognizing text content. Moreover, it adeptly captures the styles and structures present within the text image, including elements like titles, bullet points, tables, and bold text. Section A.3 provides the full output sequence using different task prompts for this example.

Kosmos-2.5 provides a unified architecture and interface for text image understanding, making it versatile for various application scenarios. Firstly, it can be fine-tuned as a single model for a wide range of text image understanding tasks, including information extraction, layout detection and analysis, visual question answering, screenshot understanding, UI automation, and many others. This unified model interface significantly streamlines downstream task training and enables the model to effectively follow instructions in real-world applications. Secondly, our solution is compatible with more powerful LLMs like GPT-3.5 or GPT-4. The output from our model can serve as contexts for LLMs, enhancing their capabilities through further prompt engineering. This approach empowers LLMs with robust text image understanding capabilities. Thirdly, we have the potential to augment the pre-training with textual data, transforming it into a general-purpose MLLM. This expanded model not only processes visual signals but also possesses strong language understanding capabilities.

Related Work

The flourishing blossom of large language models (LLM), represented by ChatGPT , has revolutionized artificial intelligence and significantly impacted numerous downstream tasks such as text translation, code generation, question answering, etc. Despite the rapid development, it is significant to recognize that the human perception of the world is not limited to language alone but encompasses a wide range of modalities, with particular emphasis on the visual modality. Many research works attempt to “bring eyes” to LLM and develop multimodal large language models (MLLM), which can be categorized into LLM-centric scheduling systems and end-to-end trainable multimodal systems.

The LLM-centric scheduling system takes advantage of many vision foundation models (e.g., Stable Diffusion , ControlNet , BLIP , etc.), and schedules these models in a language-centric manner. For example, Visual ChatGPT develops a set of prompts to incorporate visual information into ChatGPT, enabling users to draw or edit images through chatting. MM-REACT leverages vision experts to augment its multimodal capabilities by incorporating a textual prompt design that can effectively represent various visual signals, including text descriptions, coordinates, and aligned file names, for images and videos. HuggingGPT connects LLMs with extensive AI models in machine learning communities, tackling user requests through ChatGPT’s task planning, model selection, and response summarization capabilities. Further, TaskMatrix.AI largely extends the scale and connects foundation models with millions of APIs for solving tasks in both digital and physical domains. Differently, InternGPT incorporates pointing instructions (e.g., clicking and dragging) for better communication between chatbots and users, while also improving the accuracy of chatbots in performing vision-centric tasks. Nevertheless, this approach has several limitations, such as the expenses associated with API calls or the storage space required for the pre-trained weights of foundation models.

End-to-end trainable multimodal system integrates vision and language models into a unified model, which are further trained on multimodal datasets. For instance, Flamingo leverages gated cross-attention to fuse pre-trained vision and language models, showing impressive ability in downstream multimodal tasks. Besides, BLIP-2 utilized Q-Former to align the visual features with a large language model. Furthermore, Instruct-BLIP improves the training of Q-Former by introducing a novel instruction-aware visual feature extraction method. Based on this design, MiniGPT-4 uses Vicuna as the text encoder and fine-tunes detailed image descriptions to better match user intent. Sparkles unlocks multimodal instruction-following models’ capabilities in open-ended dialogues involving multiple images . LLaVA injects visual features into the language model by treating image tokens as a foreign language, and uses conversation generated by GPT-4 for fine-tuning. Kosmos-1 is trained from scratch using web-scale corpora while showing impressive performance in zero-shot, few-shot, and multimodal chain-of-thought prompting settings. Analogously, Kosmos-2 incorporates grounding and referring abilities and can accept image regions users select using bounding boxes as input. mPLUG-Owl efficiently fine-tunes the language model using low-rank adaption with multimodal instruction datasets. Otter is built using Flamingo and aims to explore multimodal in-context learning capabilities.

2 Text Image Understanding

Text image understanding is a cutting-edge technology that harnesses the power of artificial intelligence, including natural language processing and computer vision, to automatically comprehend, categorize, and extract information from documents . Any file containing written or printed characters can be considered a document, including web pages, slides, posters, and even scene text images. Documents are ubiquitous in our daily lives, so the research on documents is significant.

Before the deep learning era, researchers used rule-based heuristic approaches for document analysis . They manually observed layout information and summarized heuristic rules, but these methods are not scalable and require enormous labour costs. Subsequently, the rise of deep learning has led to significant advancements in the field of Document AI . For example, LayoutLM series employs large-scale document data for pre-training and incorporates text, layout, and image information into the model, showing impressive performance in downstream tasks like key information extraction and document question answering. Similarly, DocFormer introduces an additional task to reconstruct the document image during pre-training. Donut introduces an OCR-free document understanding Transformer, directly mapping an input document image to the desired output with OCR. MarkupLM takes advantage of large-scale webpages from Common Crawl and uses node-level hierarchical structure information as the pre-training objective. XDoc introduces a unified framework for tackling multiple document formats in one model for parameter efficiency. UDOP designs a unified model that integrates text, image, and layout modalities, showing impressive performance on diverse document understanding tasks. Pix2Struct is a pre-trained image-to-text model trained to parse masked screenshots of web pages into simplified HTML.

Despite significant progress in text image understanding, most models are designed for specific tasks and lack generalizability. On the contrary, the proposed Kosmos-2.5 represents an important step forward in this field, demonstrating the potential of MLLM in achieving robust and generalizable performance across a wide range of text image types.

Conclusion and Future Work

We introduced Kosmos-2.5, a multimodal literate model built on the strengths of Kosmos-2, designed to enhance machine understanding of text-intensive images. This model shifted from conventional encoder-only/encoder-decoder models to a more unified, decoder-only architecture. The shift to generative multimodal language modeling simplifies task interfaces, eliminating the need for complex, task-specific pipelines. Moreover, Kosmos-2.5 demonstrated potential in few-shot and zero-shot learning capabilities, laying a foundation for future advances and scalability in multimodal literate models.

Despite these promising results, our current model faces some limitations, offering valuable future research directions. For instance, Kosmos-2.5 currently does not support fine-grained control of document elements’ positions using natural language instructions, despite being pre-trained on inputs and outputs involving the spatial coordinates of text. Instruction tuning could offer a promising route to enhance this aspect of the model, leading to broader application capabilities. Furthermore, documents spanning multiple pages pose a challenge as they typically demand holistic processing and comprehension. Meanwhile, it is also feasible that Kosmos-2.5 allows for multiple image pages interleaved with text as input; however, managing long context windows remains a vital issue we aim to address in future work.

In the broader research landscape, a significant direction lies in furthering the development of model scaling capabilities. With an expanding spectrum of tasks and rising complexities, scaling up the model to handle larger volumes of data is crucial for the progression of multimodal literate models. Ultimately, our goal is to develop a model that effectively interprets both visual and textual data, and generalizes smoothly across an expanded array of text-intensive multimodal tasks.

Acknowledgement

We would like to acknowledge Zhiliang Peng for the helpful discussions.

References

Appendix A Supplementary Material

The settings of hyperparameters are demonstrated in Table 4.

A.2 Data Samples

We demonstrate some of the training samples in Kosmos-2.5, which include the input and output from IIT-CDIP, arXiv papers, PowerPoint slides, general PDFs, web screenshots, README, DOCX, LaTeX and HTML.