OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang

Introduction

Building an omnipotent model that handles as many tasks and modalities as human beings is an attractive goal in the AI community. The possibilities of achieving this goal may largely depend on whether massive varieties of modalities, tasks and training regimes can be represented with only a few forms that can be unified and managed by a single model or system.

Recent developments of the Transformer architecture have shown its potential for being a universal computation engine . In the settings of supervised learning, the “pretrain-finetune” paradigm achieves excellent success in many domains. In the regimes of few-/zero-shot learning, language models with prompt / instruction tuning prove powerful zero-/few-shot learners . These advances have provided more significant than ever opportunities for the emergence of an omni-model.

To support better generalization for open-ended problems while maintaining multitask performance and ease of use, we advocate that an omnipotent model should have the following three properties: 1. Task-Agnostic (TA): unified task representation to support different types of tasks, including classification, generation, self-supervised pretext tasks, etc., and to be agnostic to either pretraining or finetuning. 2. Modality-Agnostic (MA): unified input and output representation shared among all tasks to handle different modalities. 3. Task Comprehensiveness (TC): enough task variety to accumulate generalization ability robustly.

However, it is challenging to satisfy these properties while maintaining superior performance in downstream tasks. Current language and multimodal pretrained models readily fail at parts of these properties, due to their following design choices. 1. Extra learnable components for finetuning, e.g., task-specific heads , adapters , soft prompts . This makes the model structure task-specific and poses discrepancy between pretraining and finetuning. Such designs are also not friendly to supporting unseen tasks in a zero-shot manner. 2. Task-specific formulation. For most current methods, pretraining, finetuning and zero-shot tasks usually differ in task formulation and training objectives. This violates TA and it is burdensome to scale up the task population to achieve TC. 3. Entangling modality representation with downstream tasks. It is a common practice for Vision-Language models to take the detected objects as part of the image input features . Though it demonstrates better downstream task performance on some closed-domain datasets, it depends on an extra object detector which usually fails at open-domain data.

Therefore, we explore an omni-model for multimodal pretraining and propose OFA, hopefully “One For All”, which achieves the objectives of unifying architectures, tasks, and modalities, and supports the three properties above.This work is the latest one of our M6 series . We formulate both pretraining and finetuning tasks in a unified sequence-to-sequence abstraction via handcrafted instructions to achieve Task-Agnostic. A Transformer is adopted as the Modality-Agnostic compute engine, with a constraint that no learnable task- or modality-specific components will be added to downstream tasks. It is available to represent information from different modalities within a globally shared multimodal vocabulary across all tasks. We then support Task Comprehensiveness by pretraining on varieties of uni-modal and cross-modal tasks.

We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA is the first attempt to unify the following vision & language, vision-only and language-only tasks, including understanding and generation, e.g., text-to-image generation, visual grounding, visual question answering (VQA), image captioning, image classification, language modeling, etc., via a simple sequence-to-sequence learning framework with a unified instruction-based task representation.

OFA is pretrained on the publicly available datasets of $20$ M image-text pairs, in comparison with recent models that rely on paired data of a much larger scale . OFA achieves state-of-the-art performances in a series of vision & language downstream tasks, including image captioning, visual question answering, visual entailment, referring expression comprehension, etc.

OFA, as a multimodal pretrained model, achieves comparable performances on unimodal tasks with SOTA pretrained models in language or vision, e.g., RoBERTa, ELECTRA and DeBERTa for natural language understanding, UniLM, Pegasus and ProphetNet for natural language generation, and MoCo-v3, BEiT and MAE for image classification.

We verify that OFA achieves competitive performance in zero-shot learning. Also, it can transfer to unseen tasks with new task instructions and adapt to out-of-domain information without finetuning.

Related Work

Natural language pretraining has revolutionized the whole NLP research community. A representation of this track is the birth of BERT and GPT . A number of studies have been progressively advancing pretraining by improving pretraining tasks and designing more sophisticated model architectures . Having witnessed the success of natural language pretraining, researchers have promoted self-supervised learning (SSL) in computer vision . Recently, mirroring masked language modeling (MLM) in language pretraining, generative pretraining with ViT architecture further boosts downstream performance.

Multimodal Pretraining

Multimodal pretraining has been developing rapidly . Researchers have applied the masking strategies and the encoder-decoder architecture to adapt models to generation tasks . Besides, to simplify preprocessing, patch projection has received attention and helped Transformer achieve SOTA performance in downstream tasks . To make full use of large-scale weakly supervised data, trains a bi-encoder on $400$ million pairs and demonstrates excellent performance in retrieval tasks. Another line of work is text-to-image synthesis. A bunch of works incorporate Transformer with VQVAE or VQGAN to generate high-quality images with high resolution. However, the previously mentioned methods are limited in processing a single type of data, such as cross-modal data only or limited in their capabilities. Also, the discrepancy between pretraining and finetuning behaviors limits the transferability to open-ended data.

Unified Frameworks

To pursue the unified models, demonstrate a uniform format to represent tasks. In NLP, recent studies unify diverse tasks covering natural language understanding and generation to text-to-text transfer or language modeling . Following this idea, and demonstrate text-generation-based multimodal pretrained models. and propose a simple framework that can process information from multiple modalities with a uniform byte-sequence representation. and unify tasks of different modalities by designing various task-specific layers. explores to employ a retrieval-based unified paradigm. However, these multimodal pretrained models suffer from performance degradation in downstream tasks, e.g., VQA, image captioning, etc., and they have no image generation capability.

OFA

In this work, we propose OFA, a unified Seq2Seq framework for the unification of I/O & architectures, tasks, and modalities. The overall framework is illustrated in Figure 2.

To process different modalities without task-specific output schema, it is essential to represent data of various modalities in a unified space. A possible solution is to discretize text, image, and object and represent them with tokens in a unified vocabulary. Recent advances in image quantization have demonstrated effectiveness in text-to-image synthesis , and thus we utilize this strategy for the target-side image representations. Sparse coding is effective in reducing the sequence length of image representation. For example, an image of the resolution of $256\times 256$ is represented as a code sequence of the length of $16\times 16$ . Each discrete code strongly correlates with the corresponding patch .

Apart from representing images, it is also essential to represent objects within images as there are a series of region-related tasks. Following , we represent objects as a sequence of discrete tokens. To be more specific, for each object, we extract its label and its bounding box. The continuous corner coordinates (the top left and the bottom right) of the bounding box are uniformly discretized to integers as location tokens $\langle x_{1},y_{1},x_{2},y_{2}\rangle$ . As to the object labels, they are intrisically words and thus can be represented with BPE tokens.

Finally, we use a unified vocabulary for all the linguistic and visual tokens, including subwords, image codes, and location tokens.

Architecture

Following the previous successful practices in multimodal pretraining , we choose Transformer as the backbone architecture, and we adopt the encoder-decoder framework as the unified architecture for all the pretraining, finetuning, and zero-shot tasks. Specifically, both the encoder and the decoder are stacks of Transformer layers. A Transformer encoder layer consists of a self attention and a feed-forward network (FFN), while a Transformer decoder layer consists of a self attention, an FFN and a cross attention for building the connection between the decoder and the encoder output representations. To stabilize training and accelerate convergence, we add head scaling to self attention, a post-attention layer normalization (LN) , and an LN following the first layer of FFN . For positional information, we use two absolute position embeddings for text and images, respectively. Instead of simply adding the position embeddings, we decoupling the position correlation from token embeddings and patch embeddings . In addition, we also use 1D relative position bias for text and 2D relative position bias for image .

2 Tasks & Modalities

A unified framework is designed to provide architecture compatibility across different modalities and downstream tasks so that opportunities can arise to generalize to unseen tasks within the same model. Then we have to represent the possible downstream tasks concerning different modalities in a unified paradigm. Therefore, an essential point for the design of pretraining tasks is the consideration of multitask and multimodality.

To unify tasks and modalities, we design a unified sequence-to-sequence learning paradigm for pretraining, finetuning, and inference on all tasks concerning different modalities. Both pretraining tasks and downstream tasks of cross-modal and uni-modal understanding and generation are all formed as Seq2Seq generation. It is available to perform multitask pretraining on multimodal and uni-modal data to endow the model with comprehensive capabilities. Specifically, we share the identical schema across all tasks, while we specify handcrafted instructions for discrimination .

For cross-modal representation learning, we design $5$ tasks, including visual grounding (VG), grounded captioning (GC), image-text matching (ITM), image captioning (IC), and visual question answering (VQA). For VG, the model learns to generate location tokens specifying the region position $\langle x_{1},y_{1},x_{2},y_{2}\rangle$ based on the input of the image $x^{i}$ and the instruction “Which region does the text $x^{t}$ describe?” where $x^{t}$ refers to the region caption. GC is an inverse task of VG. The model learns to generate a description based on the input image $x^{i}$ and the instruction “What does the region describe? region: $\langle x_{1},y_{1},x_{2},y_{2}\rangle$ ”. For ITM, we use each original image-text pair as the positive sample and construct a new one as the negative by pairing the image with a randomly substituted caption. The model learns to discriminate whether the given image and text are paired by learning to generate “Yes” or “No” based on the input image $x^{i}$ and the instruction “Does the image describe $x^{t}$ ?”. As to image captioning, this task can naturally adapt to the sequence-to-sequence format. The model learns to generate the caption based on the given image and the instruction “What does the image describe?”. For VQA, we send the image and the question as the input and require the model to learn to generate correct answers.

For uni-modal representation learning, we design $2$ tasks for vision and $1$ task for language, respectively. The model is pretrained with image infilling and object detection for vision representation learning. Recent advances in generative self-supervised learning for computer vision show that masked image modeling is an effective pretraining task . In practice, we mask the middle part of the images as the input. The model learns to generate the sparse codes for the central part of the image based on the corrupted input and the specified instruction “What is the image in the middle part?”. We additionally add object detection to pretraining following . The model learns to generate human-annotated object representations, i.e., the sequence of object position and label, based on the input image and the text “What are the objects in the image?” as the instruction. Both tasks strengthen the representation learning on both pixel and object levels. For language representation learning, following the practice of , we pretrain the unified model on plain text data with text infilling.

In this way, we unify multiple modalities and multiple tasks to a single model and pretraining paradigm. OFA is pretrained jointly with those tasks and data. Thus, it can perform different tasks concerning natural language, vision, and cross-modality.

3 Pretraining Datasets

We construct pretraining datasets by incorporating Vision & Language data (i.e., image-text pairs), Vision data (i.e., raw image data, object-labeled data), and Language data (i.e., plain texts). For replication, we only use datasets that are publicly available. We carefully filter our pretraining data and exclude images that appear in the validation and test sets of downstream tasks to avoid data leakage. We provide more details about pretraining datasets in Appendix A.1.

4 Training & Inference

We optimize the model with the cross-entropy loss. Given an input $x$ , an instruction $s$ and an output $y$ , we train OFA by minimizing $\mathcal{L}=-\sum_{i=1}^{|y|}{\rm log}P_{\theta}(y_{i}|y_{<i},x,s)$ , where $\theta$ refers to the model parameters. For inference, we apply the decoding strategies, e.g., beam search, to enhance the quality of generation. However, this paradigm has several problems in classification tasks: 1. optimizing on the entire vocabulary is unnecessary and inefficient; 2. the model may generate invalid labels out of the closed label set during inference. To overcome these issues, we introduce a search strategy based on prefix tree (Trie, ). Experimental results show that the Trie-based search can enhance the performance of OFA on classification tasks. See Appendix B for more details.

5 Scaling Models

In order to investigate how OFA of different model sizes perform in downstream tasks, we have developed $5$ versions of OFA models, scaling from $33$ M to $940$ M parameters, and we list their detailed hyperparameters in Table 1.

To be more specific, we have built basic models of $\rm Base$ and $\rm Large$ sizes, $\text{OFA}\rm_{Base}$ and $\text{OFA}\rm_{Large}$ . As our network configuration is similar to BART , their sizes are similar to those of $\text{BART}\rm_{Base}$ and $\text{BART}\rm_{Large}$ . Additionally, we have developed OFA of a larger size, which we name it $\text{OFA}\rm_{Huge}$ , or OFA without specific mentioning in the tables. Its size is comparable to that of $\text{SimVLM}\rm_{Huge}$ or $\text{ViT}\rm_{Huge}$ . To investigate whether smaller OFA can still reach satisfactory performance, we have developed $\text{OFA}\rm_{Medium}$ and $\text{OFA}\rm_{Tiny}$ , which are solely around half and less than $20\%$ as large as $\text{OFA}\rm_{Base}$ .

Experiments

This section provides experimental details and analyses to demonstrate our model’s effectiveness. See Appendix A for implementation details.

We evaluate our models on different cross-modal downstream tasks, covering cross-modal understanding and generation. Specifically, we implement experiments on multimodal understanding datasets including VQAv2 for visual question answering and SNLI-VE for visual entailment, and multimodal generation including MSCOCO Image Caption for image captioning, RefCOCO / RefCOCO+ / RefCOCOg for referring expression comprehension as this task can be viewed as bounding box generation, and MSCOCO Image Caption for text-to-image generation. More details are provided in Appendix A.3.

Table 2 presents the performance of OFA and baseline models on VQA and SNLI-VE. In general, OFA achieves the best performance in both tasks with $82.0$ on the VQA test-std set and $91.2$ on the SNLI-VE test set. For smaller-size models, $\text{OFA}\rm_{Large}$ can outperform the recent SOTAs, e.g., VLMo and SimVLM, and $\text{OFA}\rm_{Base}$ can beat the SOTAs before the aforementioned two models in both tasks. This demonstrates that OFA can achieve superior performance on cross-modal understanding tasks and scaling up OFA can bring significant improvements, reflecting the strong potential of large-scale pretrained models.

Table 3 presents the performance of OFA and baseline models on the MSCOCO image captioning dataset. We report the results on the Karpathy test split, and we demonstrate the performance of models trained with Cross-Entropy optimization and additionally with CIDEr optimization based on reinforcement learning. In comparison with the previous SOTA $\text{SimVLM}\rm_{Huge}$ for Cross-Entropy optimization, OFA outperforms it by around $2$ points in CIDEr evaluation. For CIDEr optimization, OFA of the $3$ sizes all outperform the huge-size LEMON, and OFA demonstrates a new SOTA of $154.9$ CIDEr score. By May 31 2022, the single-model OFA had topped the MSCOCO Image Caption Leaderboard.https://competitions.codalab.org/competitions/3221#results

To evaluate the capability of visual grounding, we conduct experiments on RefCOCO, RefCOCO+, and RefCOCOg. While we unify locations to the vocabulary, visual grounding can be viewed as a sequence generation task. As there is only one target for each query, we limit the generation length to $4$ in order to generate a bounding box by $<x_{1},y_{1},x_{2},y_{2}>$ . Experimental results in Table 4 show that OFA reaches the SOTA performance on the $3$ datasets. Compared with the previous SOTA UNICORN , OFA achieves significant improvement with a gain of $3.61$ , $6.65$ and $4.85$ points on the testA sets of RefCOCO and RefCOCO+ as well as the test-u set of RefCOCOg.

Text-to-image generation is a challenging task even for pretrained model. As we pretrain OFA with the task “image-infilling”, i.e., recovering masked patches by generating the corresponding codes , and thus OFA is able to generate code. We thus directly finetune OFA on the MSCOCO Image Caption dataset for text-to-code generation. At the inference stage, we additionally transform the generated codes to an image with the code decoder. Specifically, we use the codes from VQGAN following . Experimental results show that OFA outperforms the baselines in all the metrics. Note that increasing the sampling size during inference is expected to bring clear improvements on FID and IS. Compared with DALLE , CogView and NÜWA , whose sampling sizes are $512$ , $60$ and $60$ , respectively, OFA outperforms these SOTA methods on FID and IS with a much smaller sampling size $24$ . This illustrates that OFA has learned better correspondence among the query text, the image and the image codes.

We compare OFA with CogView and GLIDE on generation quality with normal and counterfactual queries.For more implementation details, please refer to Appendix A.3 Normal queries describe existing things in the real world, while counterfactual queries refer to those describing things that could only exist in our imagination. For normal queries, both CogView and OFA generate images semantically consistent with the given texts, in comparison with GLIDE. The generated examples from our model can provide more sophisticated details of objects, say the horse and the double-decker bus. For counterfactual queries, we find that OFA is the only one that can generate the three imaginary scenes, which indicates its imaginative power based on its strong capability to align text to the image. See Appendix C for more qualitative examples.

2 Results on Uni-modal Tasks

As the design of OFA unifies different modalities, we evaluate its performance on unimodal tasks, namely tasks of natural language and computer vision. For natural language tasks, we evaluate OFA on $6$ tasks of the GLUE benchmark for natural language understanding and Gigaword abstractive summarization for natural language generation. For computer vision, we evaluate OFA on the classic ImageNet-1K dataset for image classification. More details are provided in Appendix A.3.

As OFA has been pretrained on plain text data, it can be directly transferred to natural language downstream tasks. For natural language generation, it is essentially a sequence-to-sequence generation task, and for natural language understanding, typically text classification, we regard them as generation tasks where labels are essentially word sequences. Additionally, for each task, we design a manual instruction to indicate the model what types of questions it should answer. We list our instruction design in Appendix A.3.

We demonstrate that even a unified multimodal pretrained model can achieve highly competitive performance in natural language tasks. Specifically, in the evaluation of natural language understanding, OFA surpasses multimodal pretrained models by large margins in all tasks. In comparison with the state-of-the-art natural language pretrained models, including RoBERTa , XLNET , ELECTRA , and DeBERTa , OFA reaches a comparable performance. In the evaluation of natural language generation, OFA even reaches a new state-of-the-art performance on the Gigaword dataset.

Also, OFA can reach a competitive performance in image classification. Table 8 shows the performance of OFA on image classification. $\text{OFA}\rm_{Large}$ achieves higher accuracy than previous backbone models such as EfficientNet-B7 and ViT-L . We also compare OFA with self-supervised pretraining models based on contrastive learning and masked image modeling. OFA outperforms contrastive-based models such as SimCLR and MoCo-v3 with similar parameters. Compared with pretrained models based on masked image modeling, e.g., BEiT-L and MAE-L , OFA can achieve similar performance.

These aforementioned results in both natural language and vision tasks indicate that a unified multimodal pretrained model is not only effective in multimodal tasks but also capable of tackling unimodal tasks, and in the future, it might be sufficient for such a model to solve complex tasks concerning different modality combinations.

3 Zero-shot Learning & Task Transfer

The instruction-guided pretraining enables OFA to perform zero-shot inference. Following Uni-Perceiver , we evaluate our model on the $6$ tasks of the GLUE benchmark, including single-sentence classification and sentence pair classification. Table 9 demonstrates that OFA generally outperforms Uni-Perceiver. However, both models do not achieve satisfactory performance in sentence-pair classification (with $\text{Acc}.<60\%$ ). We hypothesize that the missing sentence-pair data in the pretraining dataset attributes to the performance.

Also, we find that the model performance is highly sensitive to the design of instructions. To obtain the best result, one should search a proper instruction template possibly from a large pool of candidates. A slight change to manual prompts or model parameters may drastically influence the model performance, which is not robust. We leave this issue to the future work.

We observe that the model can transfer to unseen tasks well with new task instructions. We design a new task called grounded question answering and present examples in Figure 4. In this scenario, given a question about a certain region on the image, the model should provide a correct answer. We find that the model can achieve a satisfactory performance in this new task, which reflects its strong transferability. Besides, OFA can solve tasks with the out-of-domain input data. For example, OFA without finetuning achieves satisfactory performance in VQA for the out-of-domain images. Examples are demonstrated in Figure 5. OFA can also perform accurate visual grounding on the out-of-domain images, e.g., anime pictures, synthetic images, etc., and we demonstrate more examples on Figure 11 in Appendix C.

4 Ablation on Multitask Pretraining

Thanks to the unified framework, OFA has been pretrained on multiple tasks and thus endowed with comprehensive capabilities. However, the effects of each task are still undiscovered. We verify their effects on multiple downstream tasks, including image captioning, VQA, image classification, and text-to-image generation.

We first evaluate how uni-modal pretraining tasks influence the performance in both cross-modal and uni-modal tasks. Table 10 demonstrates our experimental results. We observe some interesting phenomena about the effects of uni-modal pretraining tasks. Text infilling brings improvement on image caption ( $+0.8$ CIDEr) and VQA ( $+0.46$ Acc.). Natural language pretraining betters the contextualized representation of language and thus enhances performance in cross-modal tasks. However, it is noticed that the language pretraining task may degrade the performance in image classification, leading to the decrease in ImageNet-1K ( $-1.0$ Acc.). Also, it is interesting to find that it does not encourage improvement in text-to-image generation ( $-0.1$ CLIPSIM). It may attribute to the simplicity of text in this task, which indicates that improved representation of language does not affect the performance. As to image infilling, it significantly improves the performance in image classification ( $+1.0$ Acc.) and text-to-image generation ( $+0.6$ CLIPSIM). Learning to recover images is an effective self-supervised task for image representation, and it also encourages the decoder’s ability to generate image codes. However, it hurts the performance in image captioning and VQA. Both tasks require a strong capability in generating texts, and the decoder’s learning of image generation naturally brings performance degradation in captioning ( $-0.7$ CIDEr) and VQA ( $-0.3$ Acc.).

Furthermore, we evaluate how multimodal tasks impact the performance. Previous studies have provided evidence of the contribution of conventional pretraining tasks, e.g., MLM, MOC, ITM, VQA, image captioning, etc. . However, they miss other tasks, e.g., detection and visual grounding & grounded captioning. We conduct experiments on these tasks and find that tasks predicting regions are crucial to multimodal tasks, with a performance increase in image captioning ( $+2.3$ CIDEr & $+1.4$ CIDEr) and VQA ( $+0.6$ Acc. & $+0.5$ Acc.). It suggests that detection and visual grounding & grounded captioning help the model grasp fined-grained alignments between vision and language. Region information contributes little to text-to-image generation ( $+0.1$ CLIPSIM & $+0.1$ CLIPSIM), as this task requires far less text-region alignment information. We surprisingly find that detection can encourage the performance in visual understanding ( $+0.8$ Acc.). It indicates that incorporating region information might be essential to visual understanding, especially on images with complex objects.

Conclusion

In this work, we propose OFA, a Task-Agnostic and Modality-Agnostic framework supporting Task Comprehensiveness. OFA achieves the unification in architecture, tasks and modalities, and thus is capable of multimodal & uni-modal understanding and generation, without specification in additional layers or tasks. Our experiments show that OFA creates new SOTAs in a series of tasks, including image captioning, VQA, visual entailment, and referring expression comprehension. OFA also demonstrates a comparable performance with language / vision pretrained SOTA models in uni-modal understanding and generation tasks, e.g., GLUE, abstractive summarization, and image classification. We provide a further analysis to demonstrate its capability in zero-shot learning and domain & task transfer, and we also verify the effectiveness of pretraining tasks.

In the future, we will continue exploring the issues discovered in this work. Also, we endeavor to figure out a reasonable solution to building an omni-model essentially generalizable to the complex real world.

Acknowledgments

We would like to thank Jie Zhang, Yong Li, Jiamang Wang, Shao Yuan, and Zheng Cao for their support to this project, and we would like to thank Guangxiang Zhao and Fei Sun for their insightful comments to our paper.

References

Appendix A Implementation Details

We construct pretraining datasets by incorporating Vision & Language data (i.e., image-text pairs), Vision data (i.e., raw image data, object-labeled data), and Language data (i.e., plain texts). For replication, the pretraining datasets are publicly available. We carefully filter our pretraining data and exclude images that appear in the validation and test sets of downstream tasks to avoid data leakage. The statistics on the pretraining datasets are listed in Table 11.

For vision & language pretraining, we mainly apply image-text pairs, including image-caption pairs, image-QA pairs, and image-region pairs, as the pretraining data. For the pretraining tasks of image captioning and image-text matching, we collect Conceptual Caption 12M (CC12M) , Conceptual Captions (CC3M) , SBU , MSCOCO image captions (COCO) , and Visual Genome Captions (VG Captions) . Specifically, the part of data from VG requires some additional processing. As texts in VG captions describe local regions on the images, we retrieve regions with area larger than $16,384$ pixels and construct region-caption pairs. For visual question answering, we collect VQAv2 , VG-QA , as well as GQA . VQAv2 is a visual question answering dataset with real-world photographs from COCO. VG-QA is also a visual question answering dataset with real-world photographs from VG. The questions of VG-QA are related to specific regions on the images. GQA is a large VQA dataset featuring compositional questions. The images of GQA are also collected from VG. For visual grounding and grounded captioning, we collect data from RefCOCO , RefCOCO+ , RefCOCOg and VG captions. Additional processing is applied to VG Captions for this task. Specifically, we use the data of VG that contains regions with area smaller than $16,384$ pixels for Visual Grounding, in order to encourage model to grasp fine-grained alignments between vision and language.

Uni-modal Data

Uni-modal data includes vision and language data. Vision data consists of raw images for image infilling and object-labeled images for object detection. For image infilling, we collect raw images from OpenImages, YFCC100M and ImageNet-21K , and exclude annotations. Thus the model is unable to access labels in the pretraining stage. For object detection, we collect OpenImages , Object365 , VG and COCO for object detection. Language data consists of plain texts, i.e., passages consisting of sentences. We use around 140GB of data from Pile to leverage its diversity. Specifically, we extract natural language data and implement preprocessing methods, including truncation to the length of $512$ .

A.2 Pretraining Details

For the image processing, we first resize and crop the images into different resolutions, $256\times 256$ for $\text{OFA}\rm_{Tiny}$ and $\text{OFA}\rm_{Medium}$ , $384\times 384$ for $\text{OFA}\rm_{Base}$ , $480\times 480$ for $\text{OFA}\rm_{Large}$ and $\text{OFA}\rm_{Huge}$ , with a fixed patch size of $16\times 16$ . Note that training $\text{OFA}\rm_{Large}$ and $\text{OFA}\rm_{Huge}$ are time and computation consuming, we first train them with images of the resolution of $384\times 384$ and $256\times 256$ , and continue pretraining with images of the resolution of $480\times 480$ .

For each patch, we obtain its feature vector with the first three blocks of ResNet . The ResNet module is jointly trained along with the transformer module. Note that through extensive experiments we find that random sampling patches does not bring additional benefits in our scenario. For the text processing, we tokenize the texts with the same BPE Tokenizer as BART . The maximum text sequence length of both encoder and decoder is set to $256$ . We share parameters between the embedding and the decoder softmax output layer.

From our preliminary experiments, we find that the initialization for Transformer plays an important role. For $\text{OFA}\rm_{Base}$ and $\text{OFA}\rm_{Large}$ , we initialize the transformer with most of the weights of $\text{BART}\rm_{Base}$ and $\text{BART}\rm_{Large}$ considering the slight difference between OFA Transformer and BART as described in Sec 3.1. For OFA of the other sizes, we pretrain language models with the same pretraining strategy with BART and use the pretrained weights to initialize the Transformer in OFA.

We use the AdamW optimizer with $(\beta_{1},\beta_{2})=(0.9,0.999)$ and $\epsilon=1e\text{-}8$ to pretrain our models. We set the peak learning rate to $2e\text{-}4$ , and apply a scheduler with linear decay with a warmup ratio of $0.01$ to control the learning rate. For regulation, we set dropout to $0.1$ and use weight decay with $0.01$ . We employ stochastic depth with a $0.1$ rate (applied to encoder and decoder except for convolution blocks). We mix all the pretraining data within each batch, which contains $2,048$ vision&language samples, $256$ object detection samples, $256$ image-only samples and $512$ text-only samples. All models are pretrained for at least $300K$ steps except the models used for ablation study.

A.3 Details of Downstream Tasks

We verify the capability of OFA on various downstream tasks in both finetuning and zero-shot settings. We design various task-specific instructions to transfer the knowledge learned from pretraining to downstream tasks effectively. The instructions of different tasks are listed in Table 12. For finetuning, if not specified, the input image resolution is set to $480\times 480$ , and the other hyper-parameters remain the same as for pretraining. The experimental details of different downstream tasks, including both multimodal and uni-modal tasks, are listed below:

Image captioning is a standard vision&language task that requires models to generate an appropriate and fluent caption for an image. We adopt the most widely used MSCOCO Image Caption dataset to evaluate the multi-modal generation capability of OFA. We report BLEU-4 , METEOR , CIDEr , and SPICE scores on the Karpathy test split . Following the previous standard practice, we first finetune OFA with cross-entropy loss for $2$ epochs with a batch size of $128$ and a learning rate of $1e-5$ , and label smoothing is set to $0.1$ . We then finetune the model with CIDEr optimization for $3$ epochs with a batch size of $64$ , and disable dropout and stochastic depth. We report both scores at the two stages.

Visual Question Answering

Visual question answering (VQA) is a cross-modal task that requires the models to answer the question given an image. Previous works such as VLMo or SimVLM define VQA as a classification task. They use a linear output layer to predict the probability of each candidate answer on a given set. In contrast with these studies, to adapt the generative OFA model to VQA benchmark, we use the Trie-based search strategy mentioned in Sec. 3.4 to ensure that the answer generated by OFA is constrained in the candidate set. We evaluate our model with other baselines on the commonly used VQAv2 dataset . Accuracy scores on both test-dev and test-std sets are reported. The OFA models of all the reported sizes are finetuned for $40,000$ steps with a batch size of $512$ . The learning rate is $5e-5$ with the label smoothing of $0.1$ . When finetuning $\text{OFA}\rm_{Large}$ and $\text{OFA}\rm_{Huge}$ , we increase the image resolution from $480$ to $640$ . Linear interpolation of the image absolute positional embedding proposed in is employed when transferring the pretrained OFA to VQA finetuning. During Trie-based searching, we constrain the generated answers over the most frequent $3,129$ answer candidates. Exponential moving average (EMA) with decay rate $0.9999$ is employed in finetuning.

Visual Entailment

Visual entailment requires the model to evaluate how the given image and text are semantically correlated, i.e., entailment, neutral, or contradiction. We perform experiments on the SNLI-VE dataset . The image premise, text premise and text hypothesis are fed to the encoder, and the decoder generates appropriate labels. To transfer the knowledge learned by pretraining to this task, we convert the labels entailment/neutral/contradiction to yes/maybe/no. We also use the Trie-based search strategy to constrain the generated labels over the candidate set. We report accuracy on both dev and test sets. The OFA model is finetuned for $6$ epochs with a learning rate of $2e-5$ and a batch size of $256$ .

Referring Expression Comprehension

Referring expression comprehension requires models to locate an image region described by a language query. Different from the approach taken by most previous methods which ranks a set of candidate bounding boxes detected by a pretrained object detector, our method directly predicts the best matching bounding box without any proposals. We perform experiments on RefCOCO , RefCOCO+ , and RefCOCOg . Consistent with other downstream tasks, we formulate referring expression comprehension as a conditional sequence generation task. In detail, given an image and a language query, OFA generates the box sequence (e.g., $\langle x_{1},y_{1},x_{2},y_{2}\rangle$ ) in an autoregressive manner. We report the standard metric Acc@0.5 on the validation and test sets. For finetuning, the input image resolution is set to $512\times 512$ . We finetune the OFA model on each dataset for about $10$ epochs with a batch size of $128$ . The learning rate is $3e-5$ with the label smoothing of $0.1$ . Each query only corresponds to an image region, so we limit the maximum generated length to $4$ during inference.

Image Generation

Following the same setting with , we train our model on the MS COCO train split and evaluate our model on the validation split by randomly sampling $30,000$ images. We use Fréchet Inception Distance (FID) and Inception Score (IS) to evaluate the quality of the images. Following the previous studies , we also compute CLIP Similarity Score (CLIPSIM) to evaluate the semantic similarity between the query text and the generated images. During finetuning, OFA learns to generate the image code sequence according to the given text query only. The model is first finetuned with cross-entropy and then with CLIPSIM optimization following . In the first stage, we finetune the OFA model for about $50$ epochs with a batch size of $512$ and a learning rate of $1e-3$ . In the second stage, the model is finetuned for extra $5000$ steps with a batch size of $32$ and a learning rate of $1e-6$ . During the evaluation, we sample $24$ images with the resolution of $256\times 256$ for each query and choose the best one using the pretrained CLIP model .

For case study, we compare OFA with CogView and GLIDE. CogView provides an API website https://wudao.aminer.cn/CogView/index.html. Note that this API samples 8 images of resolution of $512\times 512$ for each query. We select the first one of generated images and resize it to the resolution of $256\times 256$ . GLIDE provides a Colab notebook.https://colab.research.google.com/drive/1q6tJ58UKod1eCOkbaUNGzF3K5BbXlB5m. Note that the only publicly available GLIDE model is of base size ( $\sim$ 385M).

Image Classification

We provide finetuning results on ImageNet-1K following recent studies in self-supervised learning for computer vision. During finetuning and inference, a Trie-based search strategy is employed to constrain the generated text into the set of $1,000$ candidate labels. We finetune OFA for $32$ epochs and a batch size of $256$ . The learning rate is $5e-5$ . The ratio for label smoothing is $0.1$ . The encouraging loss proposed in is employed with the hyperparameter LE set to $0.75$ . Following , we use the same random resize cropping, random flipping, RandAug and random erasing transformations as data augmentation strategies. Mixup and CutMix are used with overall $0.5$ probability to be performed on each batch and alpha is $0.8$ and $1.0$ , respectively. To adapt the mixed soft target of Mixup and CutMix into generation paradigm during finetuning, we run the decoder twice each with one of the target sequences to be mixed and sum the loss weighted by the mixing ratio.

Natural Language Understanding

To verify the natural language understanding ability of OFA, we select $6$ language understanding tasks from GLUE benchmark , including both single-sentence classification tasks and sentence-pair classification tasks. To adapt to sentence-pair classification, previous models usually use segment embeddings to distinguish different sentences. Unlike those models, OFA can apply the model to sentence-pair classification tasks by constructing appropriate instructions without introducing additional segment embeddings. For the hyper-parameters of finetuning, we tune the training epochs among $\{5,7,10\}$ , learning rate among $\{3e-5,5e-5,6e-5,7e-5,1e-4\}$ , batch size among $\{32,64,128\}$ , weight decay among $\{0.01,0.05\}$ , and dropout rate among $\{0.0,0.1\}$ . We report the best performance on the development set for each task.

Natural Language Generation

We verify the natural language generation ability of OFA in the Gigaword dataset . We report ROUGE-1/ROUGE-2/ROUGE-L to evaluate the generation results following . We finetune the OFA models for $6$ epochs with a batch size of $512$ . The learning rate is $1e-4$ with the label smoothing of $0.1$ , and the maximum input text sequence length is set to $512$ . During inference, we set the length penalty to $0.7$ and beam size to $6$ , and limit the maximum generated length to 32.

Appendix B Trie-based Search

This section describes how to use Trie-based search to improve model performance on downstream classification tasks. When dealing with classification tasks, we first construct a Trie where nodes are annotated with tokens from the candidate label-set. During finetuning, the model computes the log-probabilities of the target tokens based on their positions on the Trie. As shown in Figure 6, when computing the log-probabilities of the target token “sky”, we only consider tokens in {“sky”, “ocean”} and forcefully set the logits for all invalid tokens to $-\infty$ . During inference, we constrain the generated labels over the candidate set. As shown in Table 13, Trie-based search strategy can boost the performance of OFA in various downstream classification tasks.

Appendix C Qualitative Examples

This section provides more qualitative examples of multiple tasks, including text-to-image generation, open-domain VQA, grounded question answering, and open-domain visual grounding, from the generation of OFA. By reading this section, we hope that readers can better perceive OFA.