Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei

cs.CL cs.CV

Introduction: From LLMs to MLLMs

Large language models (LLMs) have successfully served as a general-purpose interface across various natural language tasks . The LLM-based interface can be adapted to a task as long as we are able to transform the input and output into texts. For example, the input of the summarization task is a document and the output is its summary. So we can feed the input document into the language model and then produce the generated summary.

Despite the successful applications in natural language processing, it is still struggling to natively use LLMs for multimodal data, such as image, and audio. Being a basic part of intelligence, multimodal perception is a necessity to achieve artificial general intelligence, in terms of knowledge acquisition and grounding to the real world. More importantly, unlocking multimodal input greatly widens the applications of language models to more high-value areas, such as multimodal machine learning, document intelligence, and robotics.

In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, follow instructions (i.e., zero-shot learning), and learn in context (i.e., few-shot learning). The goal is to align perception with LLMs, so that the models are able to see and talk. To be specific, we follow MetaLM to train the Kosmos-1 model from scratch. As shown in Figure 1, a Transformer-based language model is regarded as the general-purpose interface, and perception modules are docked with the language model. We train the model on web-scale multimodal corpora, i.e., text data, arbitrarily interleaved images and texts, and image-caption pairs. In addition, we calibrate the instruction-following capability across modalities by transferring language-only data.

As shown in Table 1, the Kosmos-1 model natively supports language, perception-language, and vision tasks. We also present some generated examples in Figure 2 and 3. In addition to various natural language tasks, the Kosmos-1 models natively handle a wide range of perception-intensive tasks, spanning visual dialogue, visual explanation, visual question answering, image captioning, simple math equation, OCR, and zero-shot image classification with descriptions. We also build an IQ test benchmark following Raven’s Progressive Matrices , which evaluates the capability of nonverbal reasoning for MLLMs. The examples show that the native support of multimodal perception enables new opportunities to apply LLMs to new tasks. Moreover, we show that MLLMs achieve better commonsense reasoning performance compared with LLMs, which indicates cross-modal transfer helps knowledge acquisition.

Properly handling perception is a necessary step toward artificial general intelligence. The capability of perceiving multimodal input is critical to LLMs. First, multimodal perception enables LLMs to acquire commonsense knowledge beyond text descriptions. Second, aligning perception with LLMs opens the door to new tasks, such as robotics, and document intelligence. Third, the capability of perception unifies various APIs, as graphical user interfaces are the most natural and unified way to interact with. For example, MLLMs can directly read the screen or extract numbers from receipts. We train the Kosmos-1 models on web-scale multimodal corpora, which ensures that the model robustly learns from diverse sources. We not only use a large-scale text corpus but also mine high-quality image-caption pairs and arbitrarily interleaved image and text documents from the web.

Following the philosophy proposed in MetaLM , we regard language models as a universal task layer. Because of the open-ended output space, we are able to unify various task predictions as texts. Moreover, natural-language instructions and action sequences (such as programming language) can be well handled by language models. LLMs also serve as basic reasoners , which is complementary to perception modules on complex tasks. So it is natural to align world, action, and multimodal perception with the general-purpose interface, i.e., language models.

As shown in Table 1, apart from the capabilities found in previous LLMs , MLLMs enable new usages and possibilities. First, we can conduct zero- and few-shot multimodal learning by using natural language instructions and demonstration examples. Second, we observe promising signals of nonverbal reasoning by evaluating the Raven IQ test, which measures the fluid reasoning ability of humans. Third, MLLMs naturally support multi-turn interactions for general modalities, such as multimodal dialogue.

Kosmos-1: A Multimodal Large Language Model

As shown in Figure 1, Kosmos-1 is a multimodal language model that can perceive general modalities, follow instructions, learn in context, and generate outputs. Given the previous context, the model learns to generate texts in an auto-regressive manner. Specifically, the backbone of Kosmos-1 is a Transformer-based causal language model. Apart from text, other modalities are embedded and fed into the language model. The Transformer decoder serves as a general-purpose interface to multimodal input. We train Kosmos-1 on multimodal corpora, including monomodal data, cross-modal paired data, and interleaved multimodal data. Once the models are trained, we can directly evaluate the models in zero-shot and few-shot settings on both language tasks and multimodal tasks.

The Transformer decoder perceives general modalities in a unified way. For input format, we flatten input as a sequence decorated with special tokens. Specifically, we use ~~and~~ to denote start- and end-of-sequence. The special tokens and indicate the beginning and end of encoded image embeddings. For example, “ ~~document~~ ” is a text input, and “ ~~paragraph Image Embedding paragraph~~ ” is an interleaved image-text input. Table 21 in Appendix shows some examples of input format.

An embedding module is used to encode both text tokens and other input modalities into vectors. Then the embeddings are fed into the decoder. For input tokens, we use a lookup table to map them into embeddings. For the modalities of continuous signals (e.g., image, and audio), it is also feasible to represent inputs as discrete code and then regard them as “foreign languages” . In this work, following , we employ a vision encoder as the embedding module for input images. In addition, Resampler is used as an attentive pooling mechanism to reduce the number of image embeddings.

2 Multimodal Large Language Models (MLLMs)

MLLMs serve as general-purpose interfaces that can perform interactions with both natural language and multimodal input. The framework is flexible to handle various data types, as long as we can represent input as vectors. MLLMs combine the best of two worlds. First, the language models naturally inherit the capabilities of in-context learning and instruction following. Second, perception is aligned with language models by training on multimodal corpora.

The implementation is based on the library TorchScalehttps://github.com/microsoft/torchscale , which is designed for large-scale model training. Compared with the standard Transformer architecture, we include the following modifications:

We use Magneto , a Transformer variant, as the backbone architecture. Magneto has better training stability and superior performance across modalities. It introduces an extra LayerNorm to each sublayer (i.e., multi-head self-attention, and feed-forward network). The method has a theoretically derived initialization method to improve the optimization fundamentally, which allows us to effectively scale up the models without pain.

We employ xPos relative position encoding for better long-context modeling. The method can better generalize to different lengths, i.e., training on short while testing on longer sequences. Moreover, xPos optimizes attention resolution so that the position information can be captured more precisely. The method xPos is efficient and effective in both interpolation and extrapolation settings.

3 Training Objective

The Kosmos-1 training is conducted on web-scale multimodal corpora, including monomodal data (e.g., text corpus), cross-modal paired data (e.g., image-caption pairs), and interleaved multimodal data (e.g., documents of arbitrarily interleaved images and texts). To be specific, we use monomodal data for representation learning. For example, language modeling with text data pretrains instruction following, in-context learning, and various language tasks. Moreover, cross-modal pairs and interleaved data learn to align the perception of general modalities with language models. Interleaved data also naturally fit in the multimodal language modeling task. We present more details of training data collection in Section 3.1.

The models are trained with the next-token prediction task, i.e., learning to generate the next token depending on the previous context. The training objective is to maximize the log-likelihood of tokens in examples. Notice that only discrete tokens, such as text tokens, are accounted for in the training loss. Multimodal language modeling is a scalable way to train the models. More importantly, the emergence of various capabilities makes the training task favorable for downstream applications.

Model Training

The models are trained on web-scale multimodal corpora. The training datasets consist of text corpora, image-caption pairs, and interleaved data of images and texts.

We train our model with The Pile and Common Crawl (CC). The Pile is a massive English text dataset built for training large-scale language models, which is produced from a variety of data sources. We exclude data splits from GitHub, arXiv, Stack Exchange, and PubMed Central. We also include the Common Crawl snapshots (2020-50 and 2021-04) datasets, CC-Stories, and RealNews datasets . The entire datasets have been purged of duplicate and near-duplicate documents, as well as filtered to exclude downstream task data. Refer to Appendix B.1.1 for detailed descriptions of training text corpora.

The image-caption pairs are constructed from several datasets, including English LAION-2B , LAION-400M , COYO-700M , and Conceptual Captions . English LAION-2B, LAION-400M, and COYO-700M are collected from web pages of the Common Crawl web data by extracting image sources and the corresponding alt-text. Conceptual Captions are also from internet web pages. More details can be found in Appendix B.1.2.

We collect interleaved multimodal data from the Common Crawl snapshot, which is a publicly available archive of web pages. We use a filtering process to select about 71M web pages from the original 2B web pages in the snapshot. We then extract the text and images from the HTML of each selected web page. For each document, we limit the number of images to five to reduce noise and redundancy. We also randomly discard half of the documents that only have one image to increase the diversity. We provide more details about the data collection process in Appendix B.1.3. By using this corpus, we enable Kosmos-1 to handle interleaved text and image and improve its few-shot ability.

2 Training Setup

The MLLM component has 24 layers with 2,048 hidden dimensions, 8,192 FFN intermediate size, and 32 attention heads, resulting in about 1.3B parameters. We use Magneto’s initialization for optimization stability. For faster convergence, the image representation is obtained from a pretrained CLIP ViT-L/14 model with 1,024 feature dimensions. The images are preprocessed into 224 $\times$ 224 resolution during training. We freeze the parameters of the CLIP model except for the last layer during training. The total number of parameters of Kosmos-1 is about 1.6B. More details about hyperparameters can be found in Appendix A.

We use a batch size of 1.2 million tokens (0.5 million tokens from text corpora, 0.5 million tokens from image-caption pairs, and 0.2 million tokens from interleaved data) and train Kosmos-1 for 300k steps, corresponding to about 360 billion tokens. We adopt the AdamW optimizer with $\beta=(0.9,0.98)$ . We set the weight decay to 0.01 and the dropout rate to 0.1. The learning rate increases to 2e-4 for the first 375 warming-up steps and decays linearly to 0 for the rest of the training steps. We use SentencePiece to tokenize the text. We preprocess the data in the “full-sentence” format , which packs each input sequence with full sentences that are sampled continuously from one or more documents.

3 Language-Only Instruction Tuning

In order to better align Kosmos-1 with human instructions, we perform language-only instruction tuning . Specifically, we continue-train the model with the instruction data in the format of (instructions, inputs, and outputs). The instruction data is language-only, which is mixed with training corpora. The tuning process is conducted as language modeling. Notice that instructions and inputs are not accounted for in the loss. Section 4.9.1 shows that the improvements in the instruction-following capability can transfer across modalities.

We combine Unnatural Instructions and FLANv2 as our instruction dataset. Unnatural Instructions is a dataset that was created by using a large language model to generate instructions for various natural language processing tasks. It has 68,478 instruction-input-output triplets in its core dataset. FLANv2 is a collection of datasets that cover diverse types of language understanding tasks, such as reading comprehension, commonsense reasoning, and closed-book question answering. We randomly select 54k examples of instructions from FLANv2 to augment our instruction dataset. Details of the training hyperparameter settings are described in Appendix A.2.

Evaluation

MLLMs can handle both language tasks and perception-intensive tasks. We evaluate Kosmos-1 on various types of tasks as follows:

Zero-shot image classification with descriptions

We evaluate the perception-language capability of Kosmos-1 under vision-language settings. Specifically, we conduct zero-shot and few-shot experiments on two widely used tasks, including image captioning and visual question answering. Image captioning involves generating a natural language description of an image, while visual question answering aims to answer a natural language question with respect to an image.

We evaluate the caption generation on MS COCO Caption , and Flickr30k . We use the test set of COCO Karpathy split , which re-partitions the train2014 and val2014 images into 113,287, 5,000, and 5,000 for the training set, validation set, and test set, respectively. We conduct an evaluation on Flickr30k’s Karpathy split test set. The image resolution is 224 $\times$ 224. We use beam search to generate the captions, and the beam size is 5. In the few-shot settings, we randomly sample demonstrations from the training set. We use COCOEvalCaphttps://github.com/salaniz/pycocoevalcap to compute CIDEr and SPICE scores as the evaluation metrics. We prompt Kosmos-1 with “An image of” for zero-shot and few-shot caption generation experiments.

For visual question-answering tasks, we evaluate zero-shot and few-shot results on test-dev set of VQAv2 and test-dev set of VizWiz , respectively. The resolution of images is 224 $\times$ 224. We use greedy search for the decoding. We follow the normalization rules of the VQAv2 evaluation codehttps://github.com/GT-Vision-Lab/VQA when computing the VQA accuracy. We evaluate the performance of VQA in an open-ended setting that Kosmos-1 generates answers and stops at the (“end of sequence”) token. The prompt is “Question: {question} Answer: {answer}” for visual question answering tasks.

1.2 Results

Table 2 shows the zero-shot captioning performance on COCO Karpathy test split and Flickr30k test set. Kosmos-1 achieves remarkable results in zero-shot setting on two image captioning datasets. Specifically, our model achieves a CIDEr score of 67.1 on the Flickr30k dataset, compared to 60.6 and 61.5 for the Flamingo-3B and Flamingo-9B models, respectively. Notably, our model is able to accomplish this feat with a smaller size of 1.6B, compared to Flamingo models. This demonstrates our model’s superiority in zero-shot image captioning.

Table 3 reports the results of the few-shot ( $k=2,4,8$ ) settings. The overall performance improves as the number of shots increases from two to four. The trends are consistent across the two datasets. Moreover, the few-shot results outperform zero-shot captioning in Table 2.

Table 4 reports the zero-shot visual question answering results on VQAv2 and VizWiz. We show that Kosmos-1 can better handle the diversity and complexity of the VizWiz dataset. Kosmos-1 achieves higher accuracy and robustness than Flamingo-3B and Flamingo-9B models. In addition, our model is competitive with Flamingo on the VQAv2 dataset.

Table 5 shows the few-shot performance on visual question answering tasks. Kosmos-1 outperforms other models in few-shot ( $k=2,4$ ) settings on the VizWiz dataset. We also observe a positive correlation between the number of shots and the quality of the results on the VizWiz dataset. Moreover, the few-shot results are better than the zero-shot numbers as reported in Table 4.

2 IQ Test: Nonverbal Reasoning

Raven’s Progressive Matrices is one of the most common tests to evaluate nonverbal reasoning. The capability of nonverbal reasoning is typically a reflection of an individual’s intelligence quotient (IQ). Figure 4 shows an example. Given eight images presented in a $3\times 3$ matrix, the task is to identify the following element from six similar candidates.

The models need to conduct zero-shot nonverbal reasoning without explicitly fine-tuning. The Raven IQ test is analogous to in-context learning of language models, where the difference is whether the context is nonverbal or verbal. In order to infer the answers, the models have to recognize abstract concepts and identify the underlying patterns of given images. So the IQ task is a good testbed to benchmark the nonverbal in-context learning capability.

To evaluate the Kosmos-1 on zero-shot nonverbal reasoning, we construct a dataset of the Raven IQ test. It consists of $50$ examples collected from different websiteshttps://en.testometrika.com/intellectual/iq-test/https://en.testometrika.com/intellectual/iq-test-for-kids-7-to-16-year-old/https://iqpro.org/https://iqhaven.com/matrix-g. Each example has three (i.e., $2\times 2$ matrix), four, or eight (i.e., $3\times 3$ matrix) given images. The goal is to predict the next one. Each instance has six candidate images with a unique correct completion. We measure accuracy scores to evaluate the models. The evaluation dataset is available at https://aka.ms/kosmos-iq50.

Figure 4 illustrates how to evaluate Kosmos-1 on the Raven IQ test. The matrix-style images are flattened and fed into the models one-by-one. To enable the model to better understand the desired task, we also use a textual instruction “Here are three/four/eight images:”, “The following image is:”, and “Is it correct?” for conditioning. We append each possible candidate to the context separately and compare the probability that the model outputs “Yes” in a close-ended setting. The candidate that yields the largest probability is regarded as the prediction.

2.2 Results

Table 6 shows the evaluation results on the IQ test dataset. Both Kosmos-1 with and without language-only instruction tuning achieve 5.3% and 9.3% improvement respectively over the random baseline. The results indicate that Kosmos-1 is able to perceive abstract conceptual patterns in a nonverbal context, and then deduce the following element across multiple choices. To the best of our knowledge, it is the first time that a model can perform such zero-shot Raven IQ tests. Although there is still a large performance gap between the current model and the average level of adults, Kosmos-1 demonstrates the potential of MLLMs to perform zero-shot nonverbal reasoning by aligning perception with language models.

3 OCR-Free Language Understanding

OCR-free language understanding is a task that focuses on understanding text and images without relying on Optical Character Recognition (OCR). For example, during the Rendered SST-2 task, sentences from the Stanford Sentiment Treebank dataset are rendered as images. The model is asked to predict the sentiment of the text within the images. The task evaluates a model’s ability to read and comprehend the meaning of words and sentences directly from the images.

We evaluate OCR-free language understanding on the Rendered SST-2 test set and HatefulMemes validation set. We use accuracy as the metric for the Rendered SST-2 and report ROC AUC for the HatefulMemes dataset. We use the prompt “Question: what is the sentiment of the opinion? Answer: {answer}”, where the answer is either positive or negative for the Rendered SST-2. For the HatefulMemes task, the prompt is “Question: does this picture contain real hate speech? Answer: {answer}”, where the answer is either yes or no.

3.2 Results

As shown in Table 7, Kosmos-1 achieves a ROC AUC of 63.9% for the HatefulMemes validation set and a test accuracy of 67.1% for the Rendered SST-2 test set. It outperforms CLIP ViT-L and Flamingo-9B, which achieve AUCs of 63.3% and 57.0% on the HatefulMemes task. Note that Flamingo explicitly provides OCR text into the prompt, while Kosmos-1 does not access any external tools or resources. This indicates that Kosmos-1 has built-in abilities to read and comprehend the text in the rendered images.

4 Web Page Question Answering

Web page question answering aims at finding answers to questions from web pages. It requires the model to comprehend both the semantics and the structure of texts. The structure of the web page (such as tables, lists, and HTML layout) plays a key role in how the information is arranged and displayed. The task can help us evaluate our model’s ability to understand the semantics and the structure of web pages.

We compare the performance on the Web-based Structural Reading Comprehension (WebSRC) dataset . For comparisons, we train a language model (LLM) on the same text corpora with the same training setup as in Kosmos-1. The LLM takes the text extracted from the web page as input. Its template of the prompt is “Given the context below from web page, extract the answer from the given text like this: Qusestion: Who is the publisher of this book? Answer: Penguin Books Ltd. Context: {WebText} Q: {question} A: {answer} ”, where the {WebText} presents the text extracted from the web page. Besides using the same prompt, Kosmos-1 prepends the image before the prompt. Two example images from WebSRC are shown in Appendix C.3. Following the original paper , we use exact match (EM) and F1 scores as our evaluation metrics.

4.2 Results

The experimental results are summarized in Table 8. We observe that Kosmos-1 outperforms the LLM, indicating that Kosmos-1 can benefit from the layout and style information of web pages in images. In addition, we evaluate the performance of Kosmos-1 without the extracted text in the prompt. It shows that extracted text has a contribution of +12.0/20.7 EM/F1 to Kosmos-1, indicating that the benefit from modeling images does not sacrifice its language abilities.

5 Multimodal Chain-of-Thought Prompting

Chain-of-thought prompting allows large language models to generate a series of reasoning steps and decompose a multi-step problem into intermediate steps, which can significantly improve the performance in complex tasks. Motivated by chain-of-thought prompting, we investigate a multimodal chain-of-thought prompting using Kosmos-1. As illustrated in Figure 5, we break down perception-language tasks into two steps. In the first stage, given an image, we use a prompt to guide the model to generate a rationale. The model is then fed the rationale and a task-aware prompt to produce the final results.

We evaluate the ability of multimodal chain-of-thought prompting on the Rendered SST-2. We use the prompt “Introduce this picture in detail:” to generate the content in the picture as the rationale. Then, we use the prompt “{rationale} Question: what is the sentiment of the opinion? Answer: {answer}” to predict the sentiment, where the answer is either positive or negative.

5.2 Results

We conduct experiments to evaluate the performance of the multimodal chain-of-thought prompting. Table 9 shows that multimodal chain-of-thought prompting achieves a score of 72.9, which is 5.8 points higher than the standard prompting. By generating intermediate content, the model can recognize the text in the images and infer the sentiment of the sentences more correctly.

6 Zero-Shot Image Classification

We report the zero-shot image classification performance on ImageNet . Image classification comprehends an entire image as a whole and aims to assign a label to the image. We map each label to its category name in natural language. The model is prompted to predict the category name to perform zero-shot image classification.

Given an input image, we concatenate the image with the prompt “The photo of the”. The input is then fed into the model to obtain the category name of the image. We evaluate the model on ImageNet , which contains 1.28M training images and 50k validation images in 1k object categories. The prediction is classified as correct if it is exactly the same as the ground-truth category name. The image resolution used for evaluation is 224 $\times$ 224. We use beam search to generate the category names and the beam size is 2.

6.2 Results

As shown in Table 10, we report zero-shot results in both constrained and unconstrained settings. The difference between the two settings is whether we use the 1k object category names to constrain the decoding. Kosmos-1 significantly outperforms GIT by 4.6% under the constrained setting and 2.1% under the unconstrained setting.

7 Zero-Shot Image Classification with Descriptions

The standard approach of image classification as above is to prompt the model for the specific name of the object depicted in the image. However, there are also some classification rules customized for different users and scenarios, such as the refined classification of complex animal subspecies. We can utilize natural language descriptions to guide Kosmos-1 to distinguish images in the zero-shot setting, which makes the decision process more interpretable.

Following CUB , we construct a bird classification dataset that contains images and natural-language descriptions of categories. The dataset has three groups of binary image classification. Each group contains two animal categories with similar appearances. Our goal is to classify images given the categories’ descriptions. Table 11 presents the data samples. The first group is from , while the other two groups are collected from the website. Each category contains twenty images.

The evaluation procedure is illustrated in Figure 6. For the zero-shot setting, we provide detailed descriptions of two specific categories and use the template “Question:what is the name of {general category} in the picture? Answer:” to prompt the model for the specific category name in an open-ended manner. To evaluate the effect of providing verbal descriptions in context, we also implement a zero-shot baseline without prompting descriptions. Instead, we provide the corresponding specific names in the prompt.

7.2 Results

The evaluation results are shown in Table 12. We observe that providing descriptions in context can significantly improve the accuracy of image classification. The consistent improvements indicate that Kosmos-1 can perceive the intentions of instructions and well align the concepts in language modality with visual features in vision modality.

8 Language Tasks

The models are evaluated on the language tasks given task instructions (i.e., zero-shot) or several demonstration examples (i.e., few-shot). Text inputs are directly fed into the models as in vanilla language models.

We train a language model (LLM) baseline with the same text corpora and training setup. We evaluate Kosmos-1 and the LLM baseline on eight language tasks, including cloze and completion tasks (i.e, StoryCloze, HellaSwag), Winograd-style tasks (i.e, Winograd, Winogrande), commonsense reasoning (i.e, PIQA), and three datasets BoolQ, CB, and COPA from the SuperGLUE benchmark . The detailed descriptions of these datasets are provided in Appendix C.2. We conduct experiments under zero-shot and few-shot settings. We evaluate each test example by randomly sampling examples from the training set as demonstrations. We set the number of shots to 0, 1, and 4 in our experiments.

8.2 Results

Table 13 presents the in-context learning performance of language tasks. Kosmos-1 achieves comparable or even better performance in cloze completion and commonsense reasoning tasks when compared to LLM. In terms of the average result across all these datasets, LLM performs better in zero-shot and one-shot settings, whereas our model performs better in few-shot ( $k=4$ ) settings. The results indicate that Kosmos-1 also handles language-only tasks well and achieves favorable performance across datasets. In addition, Section 4.9.2 shows that MLLMs learn better visual commonsense knowledge compared with LLMs.

9 Cross-modal Transfer

Cross-modal transferability allows a model to learn from one modality (such as text, image, audio, etc.) and transfer the knowledge to the other modalities. This skill can enable a model to perform various tasks across different modalities. In this part, we evaluate the cross-model transferability of Kosmos-1 on several benchmarks.

To evaluate the effect of language-only instruction tuning, we conduct an ablation study using four datasets: COCO, Flickr30k, VQAv2, and VizWiz. These datasets consist of image captioning and visual questions anwsering. The evaluation metrics are: CIDEr scores for COCO/Flickr30k and VQA accuracy for VQAv2/VizWiz.

Table 14 shows the experimental results. Language-only instruction tuning boosts our model’s performance by 1.9 points on Flickr30k, 4.3 points on VQAv2, and 1.3 points on VizWiz. Our experiments show that language-only instruction tuning can significantly improve the model’s instruction-following capabilities across modalities. The results also indicate that our model can transfer the instruction-following capability from language to other modalities.

9.2 Transfer from Multimodal to Language: Visual Commonsense Reasoning

Visual commonsense reasoning tasks require an understanding of the properties of everyday objects in the real world, such as color, size, and shape. These tasks are challenging for language models because they may require more information about object properties than what is available in texts. To investigate the visual commonsense capabilities, we compare the zero-shot performance of Kosmos-1 and LLM on visual commonsense reasoning tasks.

We compare Kosmos-1 and the LLM baseline on three object commonsense reasoning datasets, RelativeSize , MemoryColor and ColorTerms datasets. Table 15 shows some examples of object size and color reasoning tasks. RelativeSize contains 486 object pairs from 41 physical objects. The model is required to predict the size relation between two objects in a binary question-answering format with “Yes”/“No” answers. MemoryColor and ColorTerms require the model to predict the color of objects from a set of 11 color labels in a multiple-choice format. We use only text as our input and do not include any images. We measure the accuracy of our model on these three datasets.

Table 16 presents the zero-shot performance of Kosmos-1 and LLM on visual commonsense reasoning tasks. Kosmos-1 significantly outperforms LLM by 1.5% on RelativeSize, 14.7% on MemoryColor, and 9.7% on ColorTerms dataset. The consistent improvements indicate that Kosmos-1 benefits from the visual knowledge to complete the corresponding visual commonsense reasoning. The reason for Kosmos-1’s superior performance is that it has modality transferability, which enables the model to transfer visual knowledge to language tasks. On the contrary, LLM has to rely on textual knowledge and clues to answer visual commonsense questions, which limits its ability to reason about object properties.

Conclusion

In this work, we introduce Kosmos-1, a multimodal large language model that can perceive general modalities, follow instructions, and perform in-context learning. The models trained on web-scale multimodal corpora achieve promising results across a wide range of language tasks and multimodal tasks. We show that going from LLMs to MLLMs enables new capabilities and opportunities. In the future, we would like to scale up Kosmos-1 in terms of model size , and integrate the speech capability into Kosmos-1. In addition, Kosmos-1 can be used as a unified interface for multimodal learning, e.g., enabling using instructions and examples to control text-to-image generation.

References

Appendix A Hyperparameters

We report the detailed model hyperparameter settings of Kosmos-1 in Table 17 and training hyperparameters in Table 18.

A.2 Language-Only Instruction Tuning

The detailed instruction tuning hyperparameters are listed in Table 19.

Appendix B Datasets

Kosmos-1 is trained on The Pile and Common Crawl. The Pile is an 800 GB English text corpus combining 22 diverse sources. We select a subset with seven sources from The Pile. Common Crawl is also included in training corpora. Common Crawl takes snapshots of the web, which contains massive amounts of language data. Table 20 provides a full overview of the language datasets that were used in the training of Kosmos-1 model. These data sources can be divided into the following three categories:

Internet: Pile-CC, OpenWebText2, Wikipedia (English), CC-2020-50, CC-2021-04, Realnews

Prose: BookCorpus2, Books3, Gutenberg , CC-Stories

B.1.2 Image-Caption Pairs

Kosmos-1 is trained on image-caption pairs constructed from several datasets, including English LAION-2B , LAION-400M , COYO-700M and Conceptual Captions . LAION-2B, LAION-400M, and COYO-700M datasets are extracted by parsing out image URLs and alt-texts of web pages from the Common Crawl web data. LAION-2B contains about 2B English image-caption pairs, LAION-400M consists of 400M English image-caption pairs, and COYO-700M has 700M English image-caption pairs. Conceptual Captions contains 15M English image-caption pairs and consists of two datasets: CC3M and CC12M, which are also collected from internet webpages using a Flume pipeline. For Conceptual Captions, we discard pairs whose captions contain special tags such as “”.

B.1.3 Interleaved Data

We collect a large corpus of 2 billion web pages from the snapshots of common crawls. To ensure quality and relevance, we apply several filtering criteria. First, we discard any pages that are not written in English. Second, we discard any pages that do not have images interspersed in the text. Third, we discard any images that have a resolution lower than 64 by 64 pixels or that are single-colored. Fourth, we discard any text that is not meaningful or coherent, such as spam or gibberish. We use some heuristics to identify and remove gibberish text containing emoji symbols, hashtags, and URL links. After applying these filters, we end up with about 71 million documents for training.

B.2 Data Format

The training data is organized in the format as follows:

Appendix C Evaluation

Figure 7 shows how we conduct zero-shot and few-shot evaluations on perception-language tasks.

C.2 Language Tasks

We conduct experiments on language tasks in four categories:

Cloze and completion tasks: StoryCloze , HellaSwag

Winograd-style tasks: Winograd , Winogrande

Three datasets from SuperGLUE benchmark : BoolQ , CB , COPA