GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang

Introduction

Tremendous advances have been made in recent years on vision-language (VL) pre-training, especially based on the large-scale data of image-text pairs, e.g., CLIP (Radford et al., 2021), Florence (Yuan et al., 2021), and SimVLM (Wang et al., 2021b). The learned representation greatly boosts the performance on various downstream tasks, such as image captioning (Lin et al., 2014), visual question answering (VQA) (Goyal et al., 2017), and image-text retrieval.

During pre-training, Masked Language Modeling (MLM) and Image-Text Matching (ITM) tasks have been widely used (Wang et al., 2020; Fang et al., 2021c; Li et al., 2020b; Zhang et al., 2021a; Chen et al., 2020b; Dou et al., 2021; Wang et al., 2021a; Kim et al., 2021). However, these losses are different from the downstream tasks, and task-specific adaptation has to be made. For example, ITM is removed for image captioning (Wang et al., 2021a; Li et al., 2020b), and an extra randomly initialized multi-layer perceptron is added for VQA (Wang et al., 2021b; Li et al., 2020b). To reduce this discrepancy, recent approaches (Cho et al., 2021; Wang et al., 2021b; Yang et al., 2021b; Wang et al., 2022b) have attempted to design unified generative models for pre-training, as most VL tasks can be cast as generation problems. These approaches typically leverage a multi-modal encoder and a text decoder with careful design on the text input and the text target. To further push the frontier of this direction, we present a simple Generative Image-to-text Transformer, named GIT, which consists only of one image encoder and one text decoder. The pre-training task is just to map the input image to the entire associated text description with the language modeling objective. Despite its simplicity, GIT achieves new state of the arts across numerous challenging benchmarks with a large margin, as summarized in Table 1.

The image encoder is a Swin-like vision transformer (Dosovitskiy et al., 2021; Yuan et al., 2021) pre-trained on massive image-text pairs based on the contrastive task (Jia et al., 2021; Radford et al., 2021; Yuan et al., 2021). This eliminates the dependency on the object detector, which is used in many existing approaches (Anderson et al., 2018; Li et al., 2020b; Wang et al., 2020; Zhang et al., 2021a; Chen et al., 2020b; Fang et al., 2021c). To extend it to the video domain, we simply extract the features of multiple sampled frames and concatenate them as the video representation. The text decoder is a transformer network to predict the associated text. The entire network is trained with the language modeling task. For VQA, the input question is treated as a text prefix, and the answer is generated in an auto-regressive way. Furthermore, we present a new generation-based scheme for ImageNet classification, where the predicted labels come directly from our generative model without pre-defining the vocabulary.

The approach is simple, but the performance is surprisingly impressive after we scale up the pre-training data and the model size. Fig. 1 shows captions generated by the GIT fine-tuned with TextCaps. The samples demonstrate the model’s strong capability of recognizing and describing scene text, tables, charts, food, banknote, logos, landmarks, characters, celebrities, products, etc., indicating that our GIT model has encoded rich multi-modal knowledge about the visual world.

We present GIT, which consists of only one image encoder and one text decoder, pre-trained on 0.8 billion image-text pairs with the language modeling task.

We demonstrate new state-of-the-art performance over numerous tasks on image/video captioning and QA (Table 1), without the dependency on object detectors, object tags, and OCR. On TextCaps, we surpass the human performance for the first time. This implies that a simple network architecture can also achieve strong performance with scaling.

We demonstrate that GIT pre-trained on the image-text pairs is capable of achieving new state-of-the-art performance even on video tasks without video-dedicated encoders.

We present a new scheme of generation-based image classification. On ImageNet-1K, we show a decent performance (88.79% top-1 accuracy) with our GIT.

Related Work

In VL pre-training, multi-task pre-training has been widely used to empower the network with multiple or enhanced capabilities. For example, MLM and ITM are widely adopted pre-training tasks (Li et al., 2020b; Kim et al., 2021; Zhang et al., 2021a; Wang et al., 2020; Xue et al., 2021b; Lu et al., 2019; Tan & Bansal, 2019). Recently, the image-text contrastive loss has also been added in Yu et al. (2022); Li et al. (2021a); Wang et al. (2021a). Since most VL tasks can be formulated as the text generation task (Cho et al., 2021), a single generation model can be pre-trained to support various downstream tasks. The input and output texts are usually carefully designed to pre-train such a generation model. For example in Cho et al. (2021), the text is properly masked as the network input and the goal is to recover the masked text span. SimVLM (Wang et al., 2021b) randomly splits a text sentence into the input and the target output. In these methods, a multi-modal transformer encoder is utilized to incorporate the text inputs before decoding the output.

For image representation, Faster RCNN has been used in most existing approaches (Anderson et al., 2018; Li et al., 2020b; Wang et al., 2020; Zhang et al., 2021a; Chen et al., 2020b; Fang et al., 2021c) to extract the region features. Recently, a growing interest is in dense representation (Huang et al., 2020; Wang et al., 2021b; a; Kim et al., 2021; Fang et al., 2021b; Dou et al., 2021; Li et al., 2021a) from the feature map, which requires no bounding box annotations. Meanwhile, it is easy to train the entire network in an end-to-end way. In addition to the representation from the feature map, object tags (Li et al., 2020b; Wang et al., 2020; Zhang et al., 2021a; Cornia et al., 2021; Fang et al., 2021b) are leveraged to facilitate the transformer to understand the context, especially the novel objects. For scene-text-related tasks, OCR is invoked to generate the scene text as additional network input, e.g., in Hu et al. (2020); Yang et al. (2021c). For the text prediction, A transformer network is typically used, which can incorporate the cross-attention module to fuse the image tokens, e.g., Cho et al. (2021); Alayrac et al. (2022); Yang et al. (2021b); Yu et al. (2022), or only the self-attention modules where the image tokens are concatenated with the text tokens, e.g., Li et al. (2020b); Chen et al. (2020b); Zhang et al. (2021a); Wang et al. (2020); Fang et al. (2021b).

Along the direction of scaling on VL tasks, LEMON (Hu et al., 2021a) studies the behavior of the detector-based captioning model with MLM. CoCa (Yu et al., 2022) studies different model sizes, but on the same pre-training data. In this paper, we present a comprehensive study on 9 various benchmarks (3 in main paper and 6 in supplementary materials, image/video captioning & QA tasks) with 3 different model sizes and 3 different pre-training data scales (9 data points for each benchmark).

Generative Image-to-text Transformer

With large-scale image-text pairs, our goal is to pre-train a VL model which is simple yet effective to benefit image/video captioning and QA tasks. As the input is the image and the output is the text, the minimal set of components could be one image encoder and one text decoder, which are the only components of our GIT as illustrated in Fig. 2.

The image encoder is based on the contrastive pre-trained model (Yuan et al., 2021). The input is the raw image and the output is a compact 2D feature map, which is flattened into a list of features. With an extra linear layer and a layernorm layer, the image features are projected into $D$ dimensions, which are the input to the text decoder. We use the image encoder pre-trained with contrastive tasks because recent studies show superior performance with such image encoder, e.g. Yuan et al. (2021); Dou et al. (2021); Alayrac et al. (2022). In Sec 4.6 and supplementary materials, we also observe the VL performance boosts significantly with a stronger image encoder. This is consistent with the observation in object detection-based approaches, e.g. in Wang et al. (2020); Zhang et al. (2021a). The concurrent work of CoCa (Yu et al., 2022) unifies the contrastive task and the generation task. as one pre-training phase. Our approach is equivalent to separating the two tasks sequentially: ( $i$ ) using the contrastive task to pre-train the image encoder followed by ( $ii$ ) using the generation task to pre-train both the image encoder and text decoder.

The text decoder is a transformer module to predict the text description. The transformer module consists of multiple transformer blocks, each of which is composed of one self-attention layer and one feed-forward layer. The text is tokenized and embedded into $D$ dimensions, followed by an addition of the positional encoding and a layernorm layer. The image features are concatenated with the text embeddings as the input to the transformer module. The text begins with the [BOS] token, and is decoded in an auto-regressive way until the [EOS] token or reaching the maximum steps. The seq2seq attention mask as in Fig. 3 is applied such that the text token only depends on the preceding tokens and all image tokens, and image tokens can attend to each other. This is different from a unidirectional attention mask, where not every image token can rely on all other image tokens.

Instead of well initializing the image encoder, we randomly initialize the text decoder. This design choice is highly motivated from the experiment studies of Wang et al. (2020), in which the random initialization shows similar performance, compared with the BERT initialization. This could be because the BERT initialization cannot understand the image signal, which is critical for VL tasks. Without dependency of the initialization, we can easily explore different design choices. The concurrent work of Flamingo (Alayrac et al., 2022) employs a similar architecture of image encoder + text decoder, but their decoder is pre-trained and frozen to preserve the generalization capability of the large language model. In our GIT, all parameters are updated to better fit the VL tasks.

An alternative architecture is the cross-attention-based decoder to incorporate the image signals instead of concatenation with self-attention. Empirically as shown in supplementary material (Appendix G.2), with large-scale pre-training, we find the self-attention-based decoder achieves better performance overall, while in small-scale setting, the cross-attention-based approach wins. A plausible explanation is that with sufficient training, the decoder parameters can well process both the image and the text, and the image tokens can be better updated with the self-attention for text generation. With cross-attention, the image tokens cannot attend to each other.

2 Pre-training

For each image-text pair, let $I$ be the image, $y_{i},i\in\{1,\cdots,N\}$ be the text tokens, $y_{0}$ be the [BOS] token and $y_{N+1}$ be the [EOS] token. We apply the language modeling (LM) loss to train the model. That is,

where CE is the cross-entropy loss with label smoothing of 0.1.

An alternative choice is MLM, which predicts typically 15% of input tokens in each iteration. To predict all tokens, we have to run at least $1/0.15=6.7$ epochs. For LM, each iteration can predict all tokens, which is more efficient for large-scale pre-training data. In Hu et al. (2021a), the ablation studies also show that LM can achieve better performance with limited epochs. In our large-scale training, the number of epoch is only 2 due to computational resource limitation, and thus we choose LM. Meanwhile, most of the recent large-scale language models are also based on LM, e.g. Brown et al. (2020); Chowdhery et al. (2022).

Without the image input, the model is reduced to a decoder-only language model, similar to GPT3 (Brown et al., 2020) in the architecture wise. Thus, this design also enables the possibility to leverage the text-only data to enrich the decoding capability with a scaled-up decoder. We leave this as future work.

3 Fine-tuning

For the image captioning task, as the training data format is the same as that in pre-training, we apply the same LM task to fine-tune our GIT.

For visual question answering, the question and the ground-truth answer are concatenated as a new special caption during the fine-tuning, but the LM loss is only applied on the answer and the [EOS] tokens. During inference, the question is interpreted as the caption prefix and the completed part is the prediction. Compared with the existing approaches (Wang et al., 2021a; b; Zhang et al., 2021a; Li et al., 2022b) for VQAv2 (Goyal et al., 2017), our model is generative without pre-defining the candidate answers, even in inference. This imposes more challenges as the model has to predict at least two correct tokens: one for the answer and another for [EOS]. In contrast, the existing work pre-collects the answer candidate, recasts the problem as a classification problem, and only needs to predict once. However, considering the benefit of the free-form answer, we choose the generative approach. Due to difficulty of the generative model, we observe slightly worse performance on VQAv2 than the discriminative existing work. For the scene-text related VQA tasks, existing approaches (Yang et al., 2021c; Hu et al., 2020) typically leverages the OCR engine to generate the scene text and use dynamic pointer network to decide the current output token should be OCR or the general text. Here, our approach depends on no OCR engine, and thus no dynamic pointer network. Empirically, we find the model gradually learns how to read the scene text with large-scale pre-training, and our model achieves new SoTA performance on these tasks.

Our model is not specifically designed for the video domain, but we find our model can also achieve competitive or even new SOTA performance with a simple architecture change. That is, we sample multiple frames from each video clip, and encode each frame via the image encoder independently. Afterwards, we add a learnable temporal embedding (initialized as zeros), and concatenate the features from sampled frames. The final representation is used in a similar way as the image representation for captioning and question answering.

We also apply our generation model to the image classification task, where the class names are interpreted as image captions, and our GIT is fine-tuned to predict the result in an auto-regressive way. This is different from existing work which normally pre-defines the vocabulary and uses a linear layer (with softmax) to predict the likelihood of each category. This new generation-based scheme is beneficial when new data and new categories are added to the existing dataset. In this case, the network can continuously train on the new data without introducing new parameters.

Experiments

We collect 0.8B image-text pairs for pre-training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a). The image encoder is initialized from the pre-trained contrastive model (Yuan et al., 2021). The hidden dimension ( $D$ ) is 768. The text decoder consists of 6 randomly-initialized transformer blocks. The total number of model parameters is 0.7 billion. The learning rates of the image encoder and the decoder are $1e^{-5}$ and $5e^{-5}$ , respectively, and follow the cosine decay to 0. The total number of epochs is $2$ . During inference, the beam size is $4$ and the length penalty (Wu et al., 2016) is 0.6 by default.

Supplementary materials show results on two smaller model variants (GITB and GITL) and one even larger model (GIT2) with full details. When comparing with existing approaches, the reference numbers are the best one reported in the corresponding paper unless explicitly specified.

2 Results on Image Captioning and Question Answering

We comprehensively evaluate the captioning performance on the widely-used Karpathy split (Karpathy & Li, 2015) of COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014), the COCO test set, nocaps (Agrawal et al., 2019) We compare all approaches including using external image-text datasets. which focuses on novel objects, TextCaps (Sidorov et al., 2020) which focuses on scene-text understanding, and VizWiz-Captions (Gurari et al., 2020) which focuses on the real use case by the vision-impaired people. The results in CIDEr (Vedantam et al., 2015) are shown in Table 2 and 3. From the results, we can see our model achieves the new SOTA performance on all these metrics except on COCO Karpathy test. On nocaps, compared with CoCa (Yu et al., 2022), our model is much smaller in the model size (0.7B vs 2.1B), but achieves higher performance (123.0 vs 120.6 in CIDEr). On Textcaps, our solution outperforms the previous SOTA (TAP Yang et al. (2021c)) by a breakthrough margin (28.5 points in CIDEr), and also surpasses the human performance for the first time. For zero/few-shot evaluation as shown in Table 3, our model can significantly benefit from more shots. With 32-shots, our approach is also better than Flamingo.

On VQA, the evaluation benchmarks include VQAv2 (Goyal et al., 2017), TextVQA (Singh et al., 2019), VizWiz-VQA (Gurari et al., 2018). ST-VQA (Biten et al., 2019), and OCR-VQA (Mishra et al., 2019). Before fine-tuning the model, we run an intermediate fine-tuning on the combination of the training data of VQAv2, TextVQA, ST-VQA, OCR-VQA, VizWiz-VQA, Visual Genome QA (Krishna et al., 2016), GQA (Hudson & Manning, 2019), and OK-VQA (Marino et al., 2019). To avoid data contamination, we remove the duplicate images of the test and validation set of the target benchmarks. As illustrated in Table 4, we achieve new SOTA on VizWiz-VQA and OCR-VQA, and same performance with prior SOTA of LaTr (Biten et al., 2022) on ST-VQA. Compared with the concurrent work of Flamingo (Alayrac et al., 2022), we achieve higher accuracy (+5.4) on TextVQA and lower (-3.29) on VQAv2. Note that Flamingo’s model size is 80B, which is 114 times of ours (0.7B). On VQAv2, we observe that our model performs worse in 1.5 points than the discriminative model of Florence (Yuan et al., 2021), which shares the same image encoder. The reason might be the increased difficulty of the generative model. That is, each correct answer requires at least two correct predictions (answer and [EOS]; 2.2 on average), while the discriminative model requires only one correct prediction. In (Wang et al., 2021b), the ablation study also shows the better performance by around 1 point than the discriminative counterpart. Another reason could be that the model of Florence for VQA leverages RoBerta (Liu et al., 2019) as the text encoder, which implicitly uses the text-only data to improve the performance.

3 Results on Video Captioning and Question Answering

On the video captioning task, the performance is evaluated on MSVD (Chen & Dolan, 2011) with the widely-used splits from Venugopalan et al. (2014), MSRVTT (Xu et al., 2016), YouCook2 (Zhou et al., 2018) (results in supplementary materials.) VATEX (Wang et al., 2019b), and TVC (Lei et al., 2020) (results in supplementary materials.). On VATEX, the performance is evaluated on both the public test and private test (evaluated on the server). Video QA is evaluated on MSVD-QA (Xu et al., 2017; Chen & Dolan, 2011), MSRVTT-QA (Xu et al., 2017; 2016), and TGIF-Frame (Jang et al., 2017), which are all open-ended tasks. The results are shown in Table 5 and Table 6 for captioning and QA, respectively. Although our model is not dedicated for video tasks, our model achieve new SOTA on MSRVD, MSRVTT, and VATEX for captioning and on MSVD-QA and TGIF-Frame for QA. For example on VATEX private test, our results are even better (93.8 vs 86.5) than CLIP4Caption++ (Tang et al., 2021), which relies on model ensemble and additional subtitle input. This is also better than Flamingo (Alayrac et al., 2022) (84.2) with 80B parameters.

4 Results on Image Classification

We fine-tune GIT on ImageNet-1k. Each category is mapped to a unique class name, and the prediction is correct only if it is exactly matched with the ground-truth label subject to more or fewer whitespacespred.replace(‘ ’, ‘’) == gt.replace(‘ ’, ‘’). As shown in Table 8, our approach can achieve descent accuracy without pre-defining the vocabulary. Compared with Florence (Yuan et al., 2021) (same image encoder), our approach is worse in about 1.2 points. The reason might be similar to the case on VQAv2. That is, the generative approach needs to predict more tokens correctly to make one correct prediction, which increases the difficulty.

Zero-shot/Few-shot. The result is shown in Table 9. With no knowledge of the vocabulary, the pretrained GIT cannot infer the expected vocabulary, and thus the exactly-match accuracy is only 1.93% (in the column of equal). However, if we relax the requirement and take it correct if the prediction contains the ground-truth, the accuracy is 40.88% (in the column of in), which shows the predicted caption can well identify the image content. If we have the vocabulary as a prior and limit the output tokens to be within the vocabulary, the accuracy drops to 33.48% (in the column of voc-prior). This may suggest the network is less natural to directly predict the category name. By fine-tuning the model with only 1 shot or 5 shots per category, we observe that the accuracy is significantly improved. This demonstrates our model can be easily adapted to downstream tasks even with a few training samples. With the shot increased from 1 to 5, the gap between voc-prior and the other two columns (equal and in) becomes smaller. This is expected as more shots can be better to guide the network to predict in-vocabulary output.

Compared with Flamingo, our GIT achieves higher accuracy. Flamingo conducts the few-shot learning without parameter update, but each test image is combined with the support training examples as extra network inputs. Meanwhile, different test image requires different support shots based on Yang et al. (2022b). These may increase the inference cost. In contrast, our model updates the parameters by a lightweight fine-tuning once, and then all these training shots are not required during inference.

5 Results on Scene Text Recognition

The task (Graves et al., 2006) aims to read scene text directly from the image. We evaluate our model in two settings. One is the GIT fine-tuned on TextCaps. The prediction is considered correct if the caption contains the ground-truth scene text word. The other is to fine-tune the model on two large scene text datasets: MJSynth (MJ) (Jaderberg et al., 2014; 2016) and SynthText (ST) (Gupta et al., 2016), where the ground-truth scene text is used as the caption. The prediction is correct if the output is the exact match to the ground-truth. Following the established setup, we evaluate on six standard benchmarks, including ICDAR 2013 (IC13) (Karatzas et al., 2013), ICDAR 2015 (IC15) (Karatzas et al., 2015), IIIT 5K-Words (IIIT) (Mishra et al., 2012), Street View Text (SVT) (Wang et al., 2011), Street View Text-Perspective (SVTP) (Phan et al., 2013), and CUTE80 (CUTE) (Risnumawan et al., 2014). The average accuracy is reported in Table 8. The accuracy on individual test sets is in supplementary materials. Our TextCaps-fine-tuned captioning model achieves an 89.9 accuracy, which demonstrates the strong scene text comprehension capability of our captioning model. After fine-tuning the model on the standard MJ+ST datasets, GIT achieves 92.9 that surpasses the prior arts (Fang et al., 2021a; He et al., 2022b) of 91.9.

6 Analysis

Model and data scaling. To study the trending with data scales, we construct two smaller pre-training datasets: one is the combination of COCO, SBU, CC3M and VG, leading to 4M images or 10M image-text pairs; the other is to further combine CC12M, leading to about 14M images or 20M image-text pairs. When pre-training on small-scale datasets, we use 30 epochs rather than 2 epochs as on the 0.8B data. For the network structure, we name our model as Huge and replace the image encoder with ViT-B/16 and ViT-L/14 from CLIP Radford et al. (2021) as Base and Large, respectively. Fig. 4 shows the results on COCO, TextCaps, and VizWiz-QA. On COCO, the base model benefits from 4M to 14M, but the performance drops with 0.8B data. The 14M data are more similar to COCO than the majority of the noisy 0.8B data. Meanwhile, the Base model with limited capacity may not be able to benefit effectively from large-scale data. Similar observations are also reported in Kolesnikov et al. (2020) for ImageNet-1k classification. On TextCaps and VizWiz-QA, all model variants benefit significantly from more pre-training data. Also, a larger backbone improves more especially with 0.8B data.

Here, we scale the image encoder. Empirically, we find it is difficult to effectively scale up the text decoder. Preliminary results are shown in Table 10, which shows a larger decoder shows no improvement. The reason might be that it is difficult to effectively train with limited amount of text by LM. Another plausible reason is that the image encoder is responsible for object recognition, and the decoder is responsible for organizing the object terms in a natural language way. The latter task might be easy since most of the descriptions follow similar patterns, e.g. object + verb + subject, and thus a small decoder is enough during end-to-end training. Larger decoders increase the learning difficulty, which might degrade the performance.

Flamingo (Alayrac et al., 2022) shows a larger decoder improves the performance. However, their decoder is pre-trained and frozen during the VL pre-training, which avoids the problem of how to effectively train the decoder. In LEMON (Hu et al., 2021a), the transformer can be scaled up to 32 layers. The reason could be that LEMON uses MLM, instead of LM, which might be more difficult to train.

Scene text in pre-training data. To understand the capability of scene text comprehension, we examine the pre-training dataset and study how many image-text pairs contain the scene text. We first run the Microsoft Azure OCR API https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text against all images in CC12M and 500K images in the web crawled images. The OCR result is compared with the associated text. It is considered matched only if the text contains an OCR result that is longer than 5 characters. It is estimated that 15% of CC12M and 31% of the downloaded images contain scene text descriptions. As the training task is to predict the texts, the network gradually learns to read the scene text.

Conclusion

In the paper, we design and train a simple generative model, named GIT, to map the input image to the associated text description on large-scale image-text pairs. On image/video captioning and question answering tasks, our model achieves new state-of-the-art performance across numerous benchmarks and surpasses the human performance on TextCaps for the first time. For the image classification, we apply the generation task to predict the label name directly. The strategy is different from the existing work with a pre-defined and fixed vocabulary, and is beneficial especially when new category data are added.

Limitations. We focus on the pretraining-and-finetuning strategy to improve the absolute performance. Empirically, we find it is unclear on how to control the generated caption and how to perform in-context learning without parameter update, which we leave as future work.

Societal impact. Compared with the existing work, our model clearly improves the performance and be more appropriate to help visually-impaired people. The model is pre-trained on large-scale data, and the data are not guaranteed to contain no toxic language, which may poison the output. Although we observe few such instances qualitatively, special care should be taken to deploy the model in practice and more research exploration is required to control the output.

Appendix

The supplementary materials provide more details on the experiments, including results with different model variants, more visualizations, ablation analysis on decoder architectures, more results on data and model scaling, etc.

Appendix A Setting

We follow Wang et al. (2021a) to preprocess the pre-training data. That is, make sure the shorter length of the image no larger than 384 and the longer side no larger than 640 while maintaining the aspect ratio. Meanwhile, all images are re-saved with quality being 90 in the JPEG format. This results in 39 terabytes. No such preprocessing is applied on the fine-tuning dataset.

A.2 Platform

The data are stored in Azure Blob Storagehttps://azure.microsoft.com/en-us/services/storage/blobs, and the training is conducted on A100 provisioned by Azure Machine Learninghttps://docs.microsoft.com/en-us/azure/machine-learning/. The code is in python with packages including Pytorchhttps://pytorch.org/, license: https://github.com/pytorch/pytorch/blob/master/LICENSE DeepSpeedhttps://github.com/microsoft/DeepSpeed, MIT license), Transformershttps://github.com/huggingface/transformers, Apache License 2.0, maskrcnn-benchmarkhttps://github.com/facebookresearch/maskrcnn-benchmark, MIT license, CLIPhttps://github.com/openai/CLIP, MIT license, OSCARhttps://github.com/microsoft/Oscar, MIT license, and VirTex (Desai & Johnson, 2021)https://github.com/kdexd/virtex, MIT license.

A.3 Network

In the main paper, we present the results of our GIT. Here, we construct two smaller model variants, named GIT ${}_{\text{B}}$ and GIT ${}_{\text{L}}$ on smaller pre-training dataset. As shown in Table 11, GIT ${}_{\text{B}}$ uses CLIP/ViT-B/16 (Radford et al., 2021) as the image encoder and is pre-trained on 10M image-text pairs or 4M images, which is a combination of COCO, SBU, CC3M and VG. GIT ${}_{\text{L}}$ uses CLIP/ViT-L/14 (Radford et al., 2021) as the image encoder and is pre-trained on 20M image-text pairs or 14M images, which is a combination of the 10M image-text pairs with CC12M.

The three model variants share the same pre-training hyperparameters. The learning rate is warmed up in the first 500 iterations, and then follows cosine decay to 0. The learning rate is $1e^{-5}$ for the image encoder and is multiplied by 5 for the randomly initialized text decoder. The batch size is 4096. Parameters are updated by AdamW (Loshchilov & Hutter, 2019) with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . The number of epochs is 2.

As the performance exhibits no signs of plateau, we further scale up the model size to 5.1B and the number of pretraining images to 10.5B (12.9B image-text pairs). The image encoder is scaled to 4.8B based on DaViT (Ding et al., 2022) and is pre-trained with the UniCL (Yang et al., 2022a; Yuan et al., 2021) task. The text decoder is enlarged to 0.3B, the hyperparameters (number of transformer layers, hidden dimension, etc) of which follow BERT-Large (Devlin et al., 2018). The model is named as GIT2.

A.4 Implementation of the Data Loader

A challenging problem is to implement the data loader efficiently as the total data size (39TB for the 0.8B images) is much larger than the local disk size (around 7TB). As the data are stored in Azure Storage, we download the data to the local disk before reading it rather than directly from the cloud. Considering the data scale may increase even larger in the future, we should make sure each operation is independent to the dataset size. In the meanwhile, the data downloading should be overlapped with the GPU computing, such that the data are always locally available when needed. The solution is outlined as follows.

The image-text pairs are evenly split among $C$ compute nodes. Each node only accesses the corresponding part.

Each node consumes the data trunk by trunk. Each trunk is $2^{20}$ image-text pairs except the last which may have fewer than $2^{20}$ data.

The data in each trunk is randomly shuffled. We shuffle the data in the trunk level such that the cost is not related with the dataset size, and hence it can be applied to even larger dataset.

The shuffled trunk data are split evenly among the GPUs within the node.

One extra process on each node (launched by local rank = 0) is created to pre-fetch at most 7 future trunks. As each trunk is designed for all ranks in one node, it is not required for other ranks to launch the pre-fetching process, which avoids the race condition.

Local storage contains at most 12 trunk data, and the oldest will be removed.

Empirically, we observe almost noThat is, the data preprocessing is faster than the training and is overlapped with the GPU training. time cost on the data loading during model training and the speed is also stable.

Appendix B Results on Image Captioning

On each task, the model is fine-tuned with 10 epochs. The batch size is 512 and the learning rate is $2.5e^{-6}$ . SCST (Rennie et al., 2017) follows the same hyperparameters if performed.

COCO Fig. 12 shows the complete results including GIT ${}_{\text{B}}$ and GIT ${}_{\text{L}}$ on COCO Karpathy split (Karpathy & Li, 2015). For the base-sized and large-sized models, our model achieves competitive performance with existing approaches but with a simplified architecture. We observe that UniversalCaptioner (Cornia et al., 2021) achieves much better performance. As a strong image encoder of CLIP/ViT-L with 0.3B parameters is used in UniversalCaptioner for both the base and large model, effectively, the model size is much larger than those in respective categories. In the meanwhile, both UniversalCaptioner (Cornia et al., 2021) and OFA (Wang et al., 2022b) use more data than our approach within base/large-sized model sizes. Fig. 13 shows the full results on the COCO test set.

nocaps. The main paper presents the overall performance on nocaps. Table 14 contains the complete results for each sub domain and other model variants.

Fig. 5 shows randomDisgusting images and images containing clear people identification information are excluded. prediction examples on the nocaps validation set. To visualize the novel concept recognition capability, we also collect sample images whose prediction contains at least one word not in the COCO training set, as illustrated in Fig. 6. As we can see, the model can well identify the novel object without the object tags as the network input.

TextCaps. No SCST (Rennie et al., 2017) is performed. Table 15 shows full results. Fig. 7 shows predictions on random validation images. We also manually group the predictions according to different scenarios, as illustrated in Fig. 8 and 9. In Fig. 8, (1-5) show examples on which the model describes the digital time displayed on screens, which is correct most of the time. (6-10) provide examples of reading scene text in Latin (Romance) languages such as French and Spanish. (11-15) show GIT’s ability in recognizing scene text in languages such as Arabic, Japanese, Korean, and Chinese. (16-20) provide examples of recognizing scene text in stylized fonts. As shown in (21-25), GIT also performs well in reading curved scene text, which is generally considered a challenging case in scene text recognition studies. In Fig. 9, samples (1-5) show examples of reading numbers on jerseys. As shown in (6-10), we observe that GIT has a strong ability in inferring occluded scene text, based on both visual and text context information. For example, “blue jays” is a baseball team name in sample (6), “asahi” is a beer brand in sample (9), and the occluded letter could be letter “t” in sample (8). (11-15) provide examples of reading hand-written scene text. (16-20) demonstrate GIT’s ability in reading long pieces of scene texts. GIT works well in organizing scene text words into a fluent and informative sentence. (21-25) show the challenging case of describing a book page, where the model needs to recognize and select the key information to describe. For example in sample (24), GIT covers the name and author of the book in the image.

In addition to the scene text captioning ability, we observe that the TextCaps-fine-tuned GIT is knowledgeable and can produce diverse and informative captions. We group the representative captions in Fig. 10. Samples (1-5) contain the descriptions of logos, such as “delta,” “tesla,” “oneplus,” etc. GIT also shows the capability of describing landmarks, e.g., “taj mahal,” “golden gate bridge,” “temple of heaven,” “Colosseum,” and “Sydney opera house” in (6-10). Samples (11-15) show examples on food images, such as “mapo tofu,” “pad thai,” “paella,” “beef wellington,” and “caprese salad.” (16-20) provide more examples of recognizing movie/cartoon characters and celebrities. Samples (21-25) describe products based on the tag or packaging information.

VizWiz-Captions. SCST is performed except GIT2, and the full results are shown in Table 17. Fig. 11 visualizes the predictions on random test images. Fig. 12 groups the results by different scenarios. The model can well recognize the banknotes, scene text on bottles/cans, menus, screens, etc., and can better help vision-impaired people in real use cases. The first row (1-5) of Fig. 12 shows the generated captions on blurry images. The second row (6-10) shows images with low image quality or key information partially occluded. For example, GIT reads the scene text “metro,” “diet coke,” and “mortrin” in samples (6,9,10), and infers the object “toothpaste” and “hard drive” in samples (7,8). Samples (11-15) recognize banknotes in different currencies and denominations. (16-20) describe scene text on bottles and cans, thus providing more informative captions such as the “bacon bits” in (16) and the “nestle water” in (20). GIT also works well in summarizing menus, pages, and screens, as shown in the bottom row (21-25).

Flickr30K. Table 17 shows the full results. SCST is not applied. For the 16/32-shot setting, the batch size is reduced to 16, and the number of iterations is 100.

Appendix C Results on Visual Question Answering

Except on VizWiz-QA, the number of fine-tuning epochs is 20 and the learning rate is $1e^{-5}$ . On VizWiz-QA, the number of epochs is 40 and the learning rate is $2e^{-5}$ . The input size is 384 and 576 for intermediate fine-tuning and the final fine-tuning, respectively. No intermediate fine-tuning is conducted for GIT ${}_{\text{B}}$ and GIT ${}_{\text{L}}$ . Full results are shown in Table 18. Fig. 14 and Fig. 13 show correct prediction on randomly selected images of VizWiz-VQA and ST-VQA, respectively. Fig. 16 and Fig. 15 show the randomly selected incorrect predictions.

Appendix D Results on Video Captioning and Question Answering

Table 21 shows the fine-tuning hyperparameters on video tasks for GIT. Table 19 and Table 20 show the complete results on video captioning and video question answering, respectively. During training, we randomly sample 6 frames with equal interval, and apply the same random crop on these frames. During inference, we uniformly sample 6 frames with center crop.

Appendix E Results on Image Classification

On ImageNet-1K (Deng et al., 2009), we map each label to a unique name. Each label belongs to an entry of WordNet hierarchy and is represented with a unique offset, e.g., 2012849. Fig. 17 illustrates the python script to generate a readable unique name given the offset. The model is fine-tuned with 10 epochs and the learning rate is $1e^{-5}$ . The batch size is 4096 for the full fine-tuning and 16 for the few-shot setting. No beam search is performed during inference.

Table 22 and Table 23 shows the full results with other model variants.

In the main paper, we demonstrated a decent accuracy of 88.79% top-1 on ImageNet-1k with our generative model in the full fine-tuning setting. As no constraint is on the output, we find that only 13 or (or $0.026\%$ ) predictions are outside of the 1K category. Fig. 18 illustrates 10 samples. Although deemed as incorrect, some predictions are reasonable. For example, the prediction of Fig. 18 (e) is ipad and is reasonable, although the ground-truth label is hand-held computer. These observations also imply that the generation model can quickly adapt to the classification task without pre-defining the vocabulary. Fig. 19 and Fig. 20 show the correct and incorrect predictions, respectively.

Appendix F Results on Scene Text Recognition

Table 24 shows the performance on six individual evaluation sets. Fig. 22 shows the visualization samples with our TextCaps-fine-tuned GIT (denoted as GIT ${}_{\text{TextCaps}}$ ) and with the MJ+ST-fine-tuned GIT (denoted as GIT ${}_{\text{MJSJ}}$ ). For scene text recognition, we resize the image with the longer edge to 384 and pad the image to a square. Visually, GIT ${}_{\text{TextCaps}}$ can well recognize the scene text, almost as good as GIT ${}_{\text{MJSJ}}$ , but in the natural language form. GIT ${}_{\text{MJSJ}}$ can adapt well to the task and predict a clean result of the scene text. Fig. 22 shows the visualizations on all six experimented benchmarks, i.e., IC13 (Karatzas et al., 2013), SVT (Wang et al., 2011), IIIT (Mishra et al., 2012), IC15 (Karatzas et al., 2015), SVTP (Phan et al., 2013), CUTE (Risnumawan et al., 2014) from the top to the bottom row, respectively. GIT performs especially well on testing images visually similar to natural images, such as the CUTE dataset shown in the bottom row. Quantitatively, GIT achieves an even larger performance improvement of $3.9\%$ absolute accuracy on Irregular-Text CUTE80.

We also finetune the pretrained GIT on the TextOCR (Singh et al., 2021) benchmarks. As the test annotations are not publicly available, we evaluate the performance on the validation set and achieve 81.27% accuracy.

Appendix G Analysis

In the main paper, we present the impact of scaling on COCO, TextCaps and VizWiz-QA. Fig. 21 shows results on other tasks. On scene-text-related QA tasks (a) and video captioning (d)/(e)/(f), both larger model sizes and more pre-training data boost the performance significantly. For VQAv2 (b), the 0.8B data help little or even worsen the performance slightly. The task data (Goyal et al., 2017) are from COCO, and the first 20M image-text pairs are more similar to COCO images than the majority of the web crawled 0.8B data. This may indicate the first 20M image-text pairs are enough for VQAv2. For video QA (c), the improvement on more pre-training data is mild. The reason might be the domain gap between the image-text pairs and the video-question-answer triplets, which reduces the benefit of more image-text data.

G.2 Cross-attention-based decoder

We concatenate the representations of the image and the text as the input to the transformer. An alternative way is to use a cross-attention module to incorporate the image representations, as in Alayrac et al. (2022); Li et al. (2022b). The former allows the image tokens to attend each other, which may refine the representation for better performance; while the latter isolates each image token. However, the former uses a shared set of projections for both the image tokens and the text tokens, while the latter uses separate projection matrices. A shared projection may be hard to learn effectively. Table 25 shows the comparison with different sets of pre-training image-text pairs. With smaller dataset, the latter with cross-attention outperforms, while with large-scale data, the former wins. A plausible explanation is that with more pre-training data, the parameters are well optimized such that the shared projection can adapt to both the image and the text domains, which mitigates the drawback of the shared parameters. With the self-attention, the image token can be attended with each other for a better representation during decoding. In all experiments, we use the former architecture.

G.3 Initialization of the text decoder

We randomly initialized the decoder. One question is whether the pre-trained weights from text corpus can give a better performance. Table 26 shows the comparison results. We study two decoder sizes: Base (following BERTB Devlin et al. (2018) with 12 layers) and Large (following BERTL with 24 layers). As we can see, the text corpus pretrained checkpoint shows no or even worse improvement than the random initialization in both Base and Large sizes. This observation is also in consistent with Table 9 of (Wang et al., 2020). The reason could be that the pretrained weights has no knowledge of the image signals, but this is the key for the VL model. Another observation is that larger decoder also exhibits no benefit, which may be attributed to the fact of more difficulty with a larger transformer. Here, the discussion is based on the fact that the weights are learnable, and thus may not be applicable to the case with frozen parameters, where the initial weight is important, e.g. (Alayrac et al., 2022; Zeng et al., 2022; Xie et al., 2022; Wang et al., 2022c). This is also not applicable to the case where the input is not image features, but the text, e.g. (Lin et al., 2021b).

G.4 Initialization of the image encoder

In our design, the image encoder is initialized from the contrastive pretraining. Table 27 shows the comparison with other initialization methods. The setting follows GITB. The image encoder is the base-sized version of ViT, which is initialized from the CLIP model, from the supervised pretraining (classification task on ImageNet), from the self-supervised pretraining (MAE (He et al., 2022a) on ImageNet), or randomly initialized. From the results, we can clearly observe the higher performance with the CLIP pretrained weights. Compared with the supervised/self-supervised, we note that the pretraining datasets for the image encoder are different here due to the availability of these weights. Although it is unclear whether the pretraining dataset is more important than the task or vice versa, we choose the contrastive pretraining as the pretraining dataset is also easy to scale up, and the result is better. For randomly initialization, we observe significant lower performance. The reason could be the small scale of the pretraining set (10M image-text pairs in the set-up of GITB). A larger dataset may reduce the gap, but it may require longer training iterations. We leave how to effectively train the model from scratch as future work.

G.5 Intermediate fine-tuning on VQA

For VQA, we conduct the intermediate fine-tuning by combining multiple VQA datasets. Table 28 shows the performance comparison with direct fine-tuning without the intermediate fine-tuning. From the results, we can see the intermediate fine-tuning improves the performance for all tasks, and the improvement is more if the target training data scale is small.

G.6 Bias study over gender and skin

Motivated by Zhao et al. (2021), we investigate the bias of our captioning model as follows. Zhao et al. (2021) provides the gender type (male or female) and the skin type (light or dark) for the COCO 2014 test images containing people. As we use the Kapathy split, we first collect the overlapped images between the Kapathy test and the images with well-defined gender and skin annotations in Zhao et al. (2021). Then, we evaluate the performance on the subset images of each category. To measure the bias, we calculate the normalized performance difference (NPD). For example of the gender, we first obtain the metric (e.g. CIDEr) on the images annotated with male ( $C_{1}$ ) and on the images with female ( $C_{2}$ ). Then, NPD is $|C_{1}-C_{2}|/(C_{1}+C_{2})$ . With no bias, $C_{1}$ equals $C_{2}$ and NPD is 0. If the model performs well on one group but totally fails on the other group (metric is 0), NPD is 1. Table 29 shows the result, and we can see that the bias ranges only from 0.7% to 5.3% across all metrics.

G.7 Scene text in pre-training data

We show in the main paper that a considerable amount of pre-training samples contain scene text descriptions. Fig. 23 groups the pre-training samples with scene text from CC12M and the downloaded by different scenarios. Samples (1-5) show the associated text which contains the scene text in a natural language way. This is in line with the requirement of scene-text related tasks, TextCaps. (6-10) show examples of long pieces of texts. The pre-training samples also describe scene text in stylized fonts (11-15), leading to GIT’s ability in robust scene text recognition. (16-20) contain pre-training examples with low-quality images and occluded/blurry/curved scene texts. In addition to the scene text pre-training samples, pre-training datasets also contain descriptions of diverse entities, e.g., the “apple logo” in (21), banknotes in (22), celebrity “Biden” in (23), landmark “empire state building” in (24), and product “beats headphone” in (25). The pre-training data plays a critical role in GIT’s capability in scene text description and informative caption generation.