Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei

cs.CV cs.CL

Introduction: The Big Convergence

Recent years have featured a trend toward the big convergence of language , vision , and multimodal pretraining. By performing large-scale pretraining on massive data, we can easily transfer the models to various downstream tasks. It is appealing that we can pretrain a general-purpose foundation model that handles multiple modalities. In this work, we advance the convergence trend for vision-language pretraining from the following three aspects.

First, the success of Transformers is translated from language to vision and multimodal problems. The unification of network architectures enables us to seamlessly handle multiple modalities. For vision-language modeling, there are various ways to apply Transformers due to the different natures of downstream tasks. For example, the dual-encoder architecture is used for efficient retrieval , encoder-decoder networks for generation tasks , and the fusion-encoder architecture for image-text encoding . However, most foundation models have to manually convert the end-task formats according to the specific architectures. Moreover, the parameters are usually not effectively shared across modalities. In this work, we adopt Multiway Transformers for general-purpose modeling, i.e., one unified architecture shared for various downstream tasks. The modular network also comprehensively considers modality-specific encoding and cross-modality fusion.

Second, the pretraining task based on masked data modeling has been successfully applied to various modalities, such as texts , images , and image-text pairs . Current vision-language foundation models usually multitask other pretraining objectives (such as image-text matching), rendering scaling-up unfriendly and inefficient. In contrast, we only use one pretraining task, i.e., mask-then-predict, to train a general-purpose multimodal foundation model. By regarding the image as a foreign language (i.e., Imglish), we handle texts and images in the same manner without fundamental modeling differences. Consequentially, image-text pairs are utilized as “parallel sentences” in order to learn the alignments between modalities. We also show that the simple yet effective method learns strong transferable representations, achieving state-of-the-art performance on both vision and vision-language tasks. The prominent success demonstrates the superiority of generative pretraining .

Third, scaling up the model size and data size universally improves the generalization quality of foundation models, so that we can transfer them to various downstream tasks. We follow the philosophy and scale up the model size to billions of parameters. Moreover, we scale up the pretraining data size in our experiments while only using publicly accessible resources for academic reproducibility. Although without using any private data, our method outperforms state-of-the-art foundation models that rely on in-house data by a decent margin. In addition, the scaling up benefits from treating images as a foreign language, as we can directly reuse the pipeline developed for large-scale language model pretraining.

In this work, we take advantage of the above ideas to pretrain a general-purpose multimodal foundation model BEiT-3. We pretrain a Multiway Transformer by performing masked data modeling on images, texts, and image-text pairs. During pretraining, we randomly mask some proportion of text tokens or image patches. The self-supervised learning objective is to recover the original tokens (i.e., text tokens, or visual tokens) given corrupted inputs. The model is general-purpose in the sense that it can be repurposed for various tasks regardless of input modalities, or output formats.

As shown in Figure 1 and Table 1, BEiT-3 achieves state-of-the-art transfer performance across a broad range of vision and vision-language tasks. We evaluate BEiT-3 on extensive downstream tasks and datasets, i.e., object detection (COCO), instance segmentation (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO). Specifically, our model outperforms previous strong foundation models despite that we only use public resources for pretraining and finetuning. The model also obtains better results than specialized models. Moreover, BEiT-3 not only performs well on vision-language tasks but also on vision tasks (such as object detection, and semantic segmentation).

BEiT-3: A General-Purpose Multimodal Foundation Model

As shown in Figure 2, BEiT-3 is pretrained by masked data modeling on monomodal and multimodal data, using a shared Multiway Transformer network. The model can be transferred to various vision and vision-language downstream tasks.

We use Multiway Transformers as the backbone model to encode different modalities. As shown in Figure 2, each Multiway Transformer block consists of a shared self-attention module, and a pool of feed-forward networks (i.e., modality experts) used for different modalities. We route each input token to the experts depending on its modality. In our implementation, each layer contains a vision expert and a language expert. Moreover, the top three layers have vision-language experts designed for fusion encoders. Refer to Figure 3 (a)(b)(c) for more detailed modeling layouts. Using a pool of modality experts encourages the model to capture more modality-specific information. The shared self-attention module learns the alignment between different modalities and enables deep fusion for multimodal (such as vision-language) tasks.

As shown in Figure 3, the unified architecture enables BEiT-3 to support a wide range of downstream tasks. For example, BEiT-3 can be used as an image backbone for various vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. It can also be finetuned as a dual encoder for efficient image-text retrieval, and a fusion model for multimodal understanding and generation tasks.

2 Pretraining Task: Masked Data Modeling

We pretrain BEiT-3 via a unified masked data modeling objective on monomodal (i.e., images, and texts) and multimodal data (i.e., image-text pairs). During pretraining, we randomly mask some percentage of text tokens or image patches and train the model to recover the masked tokens. The unified mask-then-predict task not only learns representations but also learns the alignment of different modalities. Specifically, text data is tokenized by a SentencePiece tokenizer . Image data is tokenized by the tokenizer of BEiT v2 to obtain the discrete visual tokens as the reconstructed targets. We randomly mask $15$ % tokens of monomodal texts and $50$ % tokens of texts from image-text pairs. For images, we mask $40$ % of image patches using a block-wise masking strategy as in BEiT .

We only use one pretraining task, which makes the training process scaling-up friendly. In contrast, previous vision-language models usually employ multiple pretraining tasks, such as image-text contrast, image-text matching, and word-patch/region alignment. We show that a much smaller pretraining batch size can be used with the mask-then-predict task. In comparison, contrastive-based models usually need a very large batch sizeFor example, CoCa uses $65$ k batch size, CLIP uses $32$ k batch size, and Florence uses $24$ k batch size. BEiT-3 uses a much smaller $6$ k batch size for pretraining. for pretraining, which brings more engineering challenges, such as GPU memory cost.

3 Scaling Up: BEiT-3 Pretraining

BEiT-3 is a giant-size foundation model following the setup of ViT-giant . As shown in Table 2, the model consists of a $40$ -layer Multiway Transformer with $1408$ hidden size, $6144$ intermediate size, and $16$ attention heads. All layers contain both vision experts and language experts. Vision-language experts are also employed in the top three Multiway Transformer layers. The self-attention module is shared across different modalities. BEiT-3 consists of $1.9$ B parameters in total, including $692$ M parameters for vision experts, $692$ M parameters for language experts, $52$ M parameters for vision-language experts, and $317$ M parameters for the shared self-attention module. Notice that only vision-related parameters (i.e., comparable size as ViT-giant; about 1B) are activated when the model is used as a vision encoder.

Pretraining Data

BEiT-3 is pretrained on both monomodal and multimodal data shown in Table 3. For multimodal data, there are about $15$ M images and $21$ M image-text pairs collected from five public datasets: Conceptual 12M (CC12M) , Conceptual Captions (CC3M) , SBU Captions (SBU) , COCO and Visual Genome (VG) . For monomodal data, we use $14$ M images from ImageNet-21K and $160$ GB text corpora from English Wikipedia, BookCorpus , OpenWebTexthttp://skylion007.github.io/OpenWebTextCorpus, CC-News , and Stories .

Pretraining Settings

We pretrain BEiT-3 for $1$ M steps. Each batch contains $6144$ samples in total, including $2048$ images, $2048$ texts and $2048$ image-text pairs. The batch size is much smaller than contrastive models . BEiT-3 uses $14\times 14$ patch size and is pretrained at resolution $224\times 224$ . We use the same image augmentation as in BEiT , including random resized cropping, horizontal flipping, and color jittering . A SentencePiece tokenizer with $64$ k vocab size is employed to tokenize the text data. We use the AdamW optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.98$ and $\epsilon=$ 1e-6 for optimization. We use a cosine learning rate decay scheduler with a peak learning rate of 1e-3 and a linear warmup of $10$ k steps. The weight decay is $0.05$ . Stochastic depth with a rate of $0.1$ is used. The BEiT initialization algorithmWe first randomly initialize the parameters within a small range, e.g., $[-0.02,0.02]$ . Next, we rescale the $l$ -th Transformer layer’s output matrices (i.e., the last linear projection within each sublayer) of self-attention and FFN by $\frac{1}{\sqrt{2l}}$ . is used to stabilize Transformer training.

Experiments on Vision and Vision-Language Tasks

We extensively evaluate BEiT-3 on major public benchmarks for both vision-language and vision tasks. Table 1 presents the overview of results. BEiT-3 obtains state-of-the-art performance on a wide range of vision and vision-language tasks.

We evaluate the capabilities of BEiT-3 on the widely used vision-language understanding and generation benchmarks, including visual question answering , visual reasoning , image-text retrieval , and image captioning .

The task requires the model to answer natural language questions about input images. Following previous work , we conduct finetuning experiments on the VQA v2.0 dataset and formulate the task as a classification problem. The model is trained to predict answers from the $3129$ most frequent answer candidates in the training set. BEiT-3 is finetuned as a fusion encoder to model deep interactions of images and questions for the VQA task. We concatenate the embeddings of a given question and an image, and then feed the input embeddings into Multiway Transformers to jointly encode the image-question pair. The final pooled output is fed into a classifier layer to predict the answer. The results are present in Table 4, BEiT-3 outperforms all previous models by a large margin (more than $1.7$ points), pushing the state of the art to $84.03$ with a single model.

Visual Reasoning

The task needs models to perform joint reasoning about images and natural language descriptions. We evaluate the model on the popular NLVR2 benchmark, which is to determine whether a textual description is true about a pair of images. Following previous work , we construct two image-text pairs based on the triplet input. We finetune BEiT-3 as a fusion encoder to jointly encode the image-text pairs. The final pooled outputs of the two pairs are concatenated and then fed into a classifier layer to predict the label. As shown in Table 4, BEiT-3 achieves a new state-of-the-art result for visual reasoning, outperforming CoCa by about $5.6$ points. The performance on NLVR2 reaches above $90$ % for the first time.

Image Captioning

The task aims to generate a natural language caption for the given image. We use the COCO benchmark, finetune and evaluate the model on Karpathy split . Following UniLM and s2s-ft , BEiT-3 is used as a conditional generation model via masked finetuning. To be more specific, a special self-attention mask is employed for the image captioning task. Image tokens (i.e., image patches) can only attend to each other bidirectionally within the image sequence. Tokens of the caption can attention to image tokens, their leftward caption tokens, and themselves. During finetuning, we randomly mask some percentage of caption tokens. The model is trained to recover these tokens based on the clues of the image and its leftward caption context. We also mask the special boundary token [SEP] to help the model learn to terminate the generation. For simplicity, BEiT-3 is trained with simple cross-entropy loss, without using CIDEr optimization. During inference, we generate the caption tokens one by one in an autoregressive manner. Table 4 presents the results on COCO captioning. BEiT-3 outperforms all previous models trained with cross-entropy loss, creating a new state-of-the-art image captioning result. The results demonstrate the superiority of BEiT-3 for vision-language generation.

Image-Text Retrieval

The task is to measure the similarity between images and texts. There are two directions depending on the modality of the retrieved target: image-to-text retrieval, and text-to-image retrieval. Two popular retrieval benchmarks, i.e., COCO , and Flickr30K , are used to evaluate the model. Following previous work , we use the Karpathy split for the two benchmarks. BEiT-3 is finetuned as a dual encoder for efficient image-text retrieval. Dual-encoder models separately encode images and texts to obtain their representations. Then we calculate the cosine similarity scores of these representations. Dual-encoder models are more efficient than fusion-encoder models. Because they do not have to jointly encode all possible image-text pairs.

We directly finetune BEiT-3 on COCO and Flickr30K, although the model is not pretrained with image-text contrastive loss. Surprisingly, BEiT-3 outperforms previous state-of-the-art models only using a small amount of contrastive training. The results demonstrate that BEiT-3 effectively learns alignments between images and texts via masked data modeling. In order to improve the performance, we perform intermediate finetuning with an image-text contrastive objective on the pretraining image-text pairs. We finetune the model with much fewer steps than pretraining. Then we use the model to evaluate zero-shot and finetuned image-text retrieval. The finetuned results are present in Table 5, dual-encoder BEiT-3 outperforms prior models by a large margin, achieving $3.0$ / $4.0$ absolute improvement on COCO top- $1$ image-to-text/text-to-image retrieval, and $0.8$ / $2.4$ absolute improvement on Flickr30K top- $1$ image-to-text/text-to-image retrieval. BEiT-3 also significantly outperforms fusion-encoder-based models, which require more computation cost for inference. As present in Table 6, BEiT-3 also achieves better performance than previous models on Flickr30K zero-shot retrieval.

2 Vision Downstream Tasks

In addition to vision-language downstream tasks, BEiT-3 can be transferred to a wide range of vision downstream tasks, including object detection, instance segmentation, semantic segmentation, and image classification. The number of effective parameters is comparable to ViT-giant , i.e., about 1B, when BEiT-3 is used as a vision encoder.

We conduct finetuning experiments on the COCO 2017 benchmark , which consists of $118$ k training, $5$ k validation, and $20$ k test-dev images. We use BEiT-3 as the backbone and follow ViTDet , including a simple feature pyramid and window attention, for the object detection and instance segmentation tasks. Following common practices , we first conduct intermediate finetuning on the Objects365 dataset. Then we finetune the model on the COCO dataset. Soft-NMS is used during inference. Table 7 compares BEiT-3 with previous state-of-the-art models on COCO object detection and instance segmentation. BEiT-3 achieves the best results on the COCO test-dev set with a smaller image size used for finetuning, reaching up to $63.7$ box AP and $54.8$ mask AP.

Semantic Segmentation

Semantic segmentation aims to predict the label for each pixel of the given image. We evaluate BEiT-3 on the challenging ADE20K dataset , which includes $150$ semantic categories. ADE20K contains $20$ k images for training and $2$ k images for validation. We directly follow the task transfer settings of ViT-Adapter . We use a dense prediction task adapter and employ Mask2Former as the segmentation framework. As shown in Table 8, BEiT-3 creates a new state-of-the-art result with $62.8$ mIoU, outperforming FD-SwinV2 giant model with 3B parameters by $1.4$ points. It shows that BEiT-3 achieves superior performance on the dense prediction task.

Image Classification

We evaluate the model on ImageNet-1K , which contains $1.28$ M training images and $50$ k validation images in $1$ k classes. Rather than appending a task layer to the vision encoder , we formulate the task as an image-to-text retrieval task. We use the category names as texts to construct image-text pairs. BEiT-3 is trained as a dual encoder to find the most relevant label for an image. During inference, we first compute the feature embeddings of possible class names and the feature embedding of the image. Their cosine similarity scores are then calculated to predict the most probable label for each image. Table 9 reports the results on ImageNet-1K. We first perform intermediate finetuning on ImageNet-21K, then we train the model on ImageNet-1K. For a fair comparison, we compare with the previous models only using public image-tag data. BEiT-3 outperforms prior models, creating a new state-of-the-art result when only using public image-tag data.

Conclusion

In this paper, we present BEiT-3, a general-purpose multimodal foundation model, which achieves state-of-the-art performance across a wide range of vision and vision-language benchmarks. The key idea of BEiT-3 is that image can be modeled as a foreign language, so that we can conduct masked “language” modeling over images, texts, and image-text pairs in a unified way. We also demonstrate that Multiway Transformers can effectively model different vision and vision-language tasks, making it an intriguing option for general-purpose modeling. BEiT-3 is simple and effective, and is a promising direction for scaling up multimodal foundation models. For future work, we are working on pretraining multilingual BEiT-3 and including more modalities (e.g., audio) in BEiT-3 to facilitate the cross-lingual and cross-modality transfer, and advance the big convergence of large-scale pretraining across tasks, languages, and modalities. We are also interested in enabling in-context learning capability for multimodal foundation models by combining the strength of BEiT-3 and MetaLM .

References

Appendix A Effects of Intermediate Finetuning for Retrieval

As shown in Table 10, we directly finetune BEiT-3 on COCO and Flickr30K. BEiT-3 still outperforms previous state-of-the-art models, even without using image-text contrastive objective during pretraining. The results demonstrate the effectiveness of masked data modeling for learning cross-modal representations. Next, we perform intermediate finetuning on the pretraining image-text pairs for $5$ epochs with a $16$ k batch size. The peak learning is 3e-5, with linear warmup over the first epoch. The image input size is $224\times 224$ . The weight decay is set to $0.05$ . We disable dropout as in pretraining and use drop path with a rate of $0.3$ . The layer-wise learning rate decay is $0.95$ . We use the AdamW optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ .