PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

cs.CV cs.CL

Introduction

Increasing neural network capacity has been a successful trend in the modeling of language and vision tasks. On the language side, models such as T5 (Raffel et al., 2020), GPT-3 (Brown et al., 2020), Megatron-Turing (Shoeybi et al., 2019), GLaM (Du et al., 2022), Chinchilla (Hoffmann et al., 2022), and PaLM (Chowdhery et al., 2022) have shown significant advantages from training large Transformers on large amounts text data. On the vision side, CNNs (Mahajan et al., 2018; Huang et al., 2019; Kolesnikov et al., 2020), Vision Transformers (Dosovitskiy et al., 2021), and other models (Tolstikhin et al., 2021; Riquelme et al., 2021) have seen similar benefits from scale (Zhai et al., 2022a), albeit to a lesser extent than in language. Language-and-vision modeling has followed a similar trend, e.g., SimVLM (Wang et al., 2021), Florence (Yuan et al., 2021), CoCa (Yu et al., 2022), GIT (Wang et al., 2022a), BEiT-3 (Wang et al., 2022c), and Flamingo (Alayrac et al., 2022).

We introduce \NAME, a model that performs image-only, language-only, and image+language tasks across many languages, using a single “image-and-text to text” interface. A key characteristic of \NAMEis a more balanced parameter share between the language and vision components, with more capacity to the vision backbone yielding large gains in performance. Another key ingredient to \NAMEis the reuse of large unimodal backbones for language and vision modeling, in order to transfer existing capabilities and reduce training cost. On the language side, we reuse the 13B-parameter model mT5-XXL (Xue et al., 2021), which already packages language understanding and generation capabilities. We show that these capabilities are maintained and extended into a multimodal setting. On the vision side, in addition to reusing the 2B-parameter ViT-G model (Zhai et al., 2022a), we train a 4B-parameter model, which we call ViT-e (“enormous”). ViT-e achieves good performance on image-only tasks, such as 90.9% ImageNet fine-tuning, and 84.9% on ObjectNet (Barbu et al., 2019).

We find benefits from jointly scaling both the vision and the language components, with vision providing a better return on investment (accuracy improvement per parameter/FLOP). As a result, the capacity of our largest \NAMEmodel, \NAME-17B, is distributed relatively equitably between the two modalities, with the ViT-e component accounting for about 25% of the total parameter count. This is not always the case for prior work in large-capacity vision and language modeling (Wang et al., 2022a; Alayrac et al., 2022), due to the prior scale mismatch between vision and language backbones. We enable knowledge-sharing between multiple image and/or language tasks by casting them into a generalized VQA-like task. We frame all tasks using an “image+query to answer” modeling interface, in which both the query and answer are expressed as text tokens. This allows \NAMEto capitalize on transfer learning across tasks, and enhance language-and-image understanding capabilities in a wide range of vision and language problems: image captioning, visual question-answering, scene-text understanding, and others (Figure 1).

To train \NAME-17B, we build a new high-volume image-and-language dataset WebLI, which consists of 10 billion images and tens of billions of image-text pairs. Importantly, the WebLI dataset contains text in over 100 languages. By training the model to perform multimodal tasks in many languages, we greatly increase the task diversity, and test the model’s ability to effectively scale both across tasks and across languages. As a reference for future usage, we provide a data card to report information about the WebLI and its construction.

-17B achieves state-of-the-art (SOTA) results on multiple benchmarks, outperforming some strong models. Specifically, \NAMEoutperforms recent and concurrent models on the long-standing COCO Captioning benchmark (Chen et al., 2015), with 149.1 CIDEr score on the Karpathy split (Karpathy & Fei-Fei, 2015). \NAMEalso achieves a new SOTA of 84.3% on VQAv2 (Goyal et al., 2017) while using an open-vocabulary text generative setting that is similar to Flamingo (Alayrac et al., 2022). This result outperforms even models evaluated in a fixed-vocabulary classification setting, e.g. CoCa (Yu et al., 2022), SimVLM (Wang et al., 2021), BEiT-3 (Wang et al., 2022c). Last but not least, our work provides a scaling roadmap for future multimodal models. Our results support the conclusion that scaling the components of each modality yields better performance compared to more skewed alternatives. Model scaling is also important for language-image understanding in multiple languages. In summary, our contributions are the following:

We design a simple, modularized and scalable sequence-to-sequence learning architecture that can be efficiently trained by reusing existing Transformer-based unimodal checkpoints.

We perform joint scaling on both the language and vision components for a wide range of parameters, and show no saturation of performance on both components for the largest model size we consider, PaLI-17B. More importantly, we show that multimodal performance greatly benefits from scaling the vision component beyond the previous-largest ViT, which provides a scaling roadmap for future vision & language models.

We empirically validate that a mixture-of-objectives benefits the performance of large vision & language models.

We scale up pre-training data to include over 100 languages, and train a large-capacity multilingual multimodal model. We show that a properly-scaled model can handle well a large number of languages, while still achieving SOTA performance on English-only tasks.

Related Work

Pre-trained models have proven effective in both vision (Dosovitskiy et al., 2021; Zhai et al., 2022a) and language (Raffel et al., 2020; Brown et al., 2020) tasks. Image-text pre-training has also become the default approach to tackle V&L tasks (Tan & Bansal, 2019; Chen et al., 2020; Zhang et al., 2021; Cho et al., 2021; Hu et al., 2022). While benefiting from the text representation and generation capabilities of the Transformer architecture, some of these vision-language models rely on external systems (such as Fast(er) R-CNN (Ren et al., 2015)) to provide detected object names and the related precomputed dense features. Such reliance limited the capability to scale up the model and performance. With the introduction of Vision Transformers (Dosovitskiy et al., 2021), vision and language modalities can be jointly modeled by transformers in a more scalable fashion (Yuan et al., 2021; Yu et al., 2022; Wang et al., 2022a; Alayrac et al., 2022).

One approach for image-text pre-training is contrastive learning (Radford et al., 2021; Jia et al., 2021). Zhai et al. (2022b) show that with a pre-trained and locked vision model, one needs to train only a paired text encoder model to get good language embeddings. Yuan et al. (2021) extend contrastively pre-trained models to more downstream tasks with task-specific adaptations. Beside image and language, MERLOT (Zellers et al., 2021) has found success in video understanding and reasoning through video-language pretraining. Another approach is to train vision-language models to generate text autoregressively (Donahue et al., 2015; Vinyals et al., 2015). This approach has the advantage of a unified formulation of vision-language tasks as a text generation problem (Cho et al., 2021; Wang et al., 2022b; Piergiovanni et al., 2022b). In Cho et al. (2021), the vision-language model is trained to recover masked text. SimVLM (Wang et al., 2021) propose an image-language pre-training approach leveraging a prefix language modeling objective. The unified framework OFA (Wang et al., 2022b) extends the generation capability to include text to image generation. Concurrent with our work, Unified-IO (Lu et al., 2022) further scaled up the number of objectives and tasks and demonstrated decent performance across the board through only multi-task pre-training without task-specific fine-tuning.

Recent works explore joint vision and language modeling with increased model capacity. CoCa (Yu et al., 2022) pre-trains a 2.1B image-text encoder-decoder model jointly with contrastive loss and generative loss. GIT (Wang et al., 2022a) trains a model consisting of a single image encoder and a text decoder with a captioning (generative) loss, where the image encoder is pre-trained with contrastive loss. In their latest version, GIT2, the model size is scaled up to 5.1B, with the majority of parameters on the vision side (4.8B). BEiT-3 (Wang et al., 2022c) presents an architecture with vision, language, and vision-language experts, operating with a shared multi-head self-attention followed by a switch for “expert” modules, resulting in a 1.9B model trained from scratch on a variety of public image, text and image-text datasets. Flamingo (Alayrac et al., 2022) is built upon a 70B language model (Hoffmann et al., 2022) as a decoder-only model whose majority of parameters are frozen in order to preserve language-generation capabilities, along with a 435M vision encoder.

Vision-language pre-training also benefits from automatically mined and filtered large-scale datasets such as Conceptual Captions (CC3M) and CC12M (Sharma et al., 2018; Changpinyo et al., 2021), with 3 and 12 million image-text pairs, respectively. With more relaxed filtering, LEMON (Hu et al., 2022) collected a larger dataset with 200M examples, which is further expanded to 800M examples in GIT (Wang et al., 2022a). For better scaling the model, larger, noisier datasets such as the ALIGN dataset (1.8B) (Jia et al., 2021) have been constructed, which has benefited SimVLM (Wang et al., 2021) and CoCa (Yu et al., 2022). While these image-text datasets have fueled the foundational V&L models with state-of-the-art performance, they are English-only, and there has been limited attempts to create datasets not English-centric and unlock the multilingual capability of these models.

The \NAMEModel

With \NAME, we aim to perform both unimodal (language, vision) and multimodal (language and vision) tasks. Typically, many of these tasks are best handled by different models. For instance, image classification, and many formulations of VQA, require predicting elements from a fixed set, while language-only tasks and image captioning require open-vocabulary text generation. Similar to the recent work OFA (Wang et al., 2022b) and a concurrent work (Lu et al., 2022), we resolve this by using a sufficiently general interface for all tasks considered: the model accepts as input an image and text string, and generates text as output. The same interface is used both during pre-training and fine-tuning. Since all tasks are performed with the same model, i.e. we have no tasks-specific parameters or “heads”, we use text-based prompts to indicate to the model which task to perform.

Figure 1 shows a high-level schematic of the model architecture. At its core, \NAMEhas a text encoder-decoder Transformer (Vaswani et al., 2017). To include vision as input, the text encoder is fed with a sequence of visual “tokens”: output patch features of a Vision Transformer which takes as input an image. No pooling is applied to the output of the Vision Transformer before passing the visual tokens to the encoder-decoder model via cross-attention. We reuse previously trained unimodal checkpoints. For the text encoder-decoder, we reuse pre-trained mT5 (Xue et al., 2021) models, while for the image encoder, we reuse large vanilla ViT models (Dosovitskiy et al., 2021; Zhai et al., 2022a).

The visual component We introduce and train the largest vanilla ViT architecture to date, named ViT-e. ViT-e has the same architecture and uses the same training recipe as the 1.8B parameter ViT-G model (Zhai et al., 2022a), while scaling to 4B parameters. The only other difference is that we apply learning rate cool-down twice, once with and once without inception crop augmentation, and average (“soup”) the weights of the two models as in Wortsman et al. (2022). While the scaling laws have been studied in both the vision domain and the language domain, scaling behaviour is less explored in combined vision and language models. Scaling up vision backbones leads to saturating gains on classification tasks such as ImageNet (Zhai et al., 2022a). We further confirm this, observing that ViT-e is only marginally better than ViT-G on ImageNet (Table 17). However, we observe substantial performance improvements from ViT-e on vision-language tasks in \NAME(Section 4). For example, ViT-e yields almost three additional CIDEr points over ViT-G on the COCO captioning task. This hints towards future headroom for vision-language tasks with even larger ViT backbones.

We adopt the mT5 (Xue et al., 2021) backbone as our language component. We experiment using the pre-trained mT5-Large (1B parameters) and mT5-XXL (13B parameters), from which we initialize the language encoder-decoder of \NAME. We train on a mix of many tasks, including pure language understanding tasks (Section A.2). This helps avoid catastrophic forgetting of the mT5’s language understanding and generation abilities. As a result, \NAME-17B continues to achieve similar levels of language-understanding accuracy on both the English benchmarks (Wang et al., 2019a) and across languages measured by the XTREME benchmark (Hu et al., 2020) (Section 4).

The overall model

Three model sizes are considered (Table 7): 1) \NAME-3B, where the language component is initialized from mT5-Large (Xue et al., 2021) (1B parameters), and the vision component is ViT-G (Zhai et al., 2022a) (1.8B parameters). 2) \NAME-15B, where the language component is initialized from mT5-XXL (Xue et al., 2021) (13B parameters), and the vision component is ViT-G (1.8B parameters). 3) \NAME-17B, where the language model is initialized from mT5-XXL, and the vision component is the newly-trained ViT-e model (4B parameters).

2 Data

Scaling studies for deep learning show that larger models require larger datasets to train effectively (Hoffmann et al., 2022; Kaplan et al., 2020; Zhai et al., 2022a). To unlock the potential of multilingual image-language pre-training, we introduce WebLI, a multilingual image-language dataset built from images and texts available on the public web. WebLI scales up the image language data collection from English-only datasets to 109 languages, which enables us to pre-train \NAMEmultilingually, and perform downstream tasks across many languages. The data collection process is similar to those reported in (Jia et al., 2021; Zhai et al., 2022b). Due to the abundance of multilingual content on the internet, the collection process for the WebLI dataset can be scaled to cover 10 billion images and 12 billion alt-texts. In addition to annotation with web text, we use publicly available automatic service to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs. To balance quality and retain scale, we filter the dataset to the highest quality subset retaining only the top 10% scoring of the original WebLI image-text pairs (about 1B examples), which we use to train \NAME. Examples and statistics for the WebLI corpus and a complete datasheet (Pushkarna et al., 2022) are shown in Appendix B (Figure 2) and G.

Training mixture

We specify each task using a training data source and a template-based prompt, and train the model using a language-model–style teacher forcing (Goodfellow et al., 2016) with a standard softmax cross-entropy loss. The coefficients for the training mixture are empirically determined, with 1.6B total examples in the mixture (Appendix A.2). The whole mixture is slightly smaller and designed to be cleaner than the datasets used in SimVLM (1.8B), CoCa (1.8B), and Flamingo (2.3B). However, unlike the aforementioned datasets, examples in our 1.6B dataset follow a long-tailed distribution over the 100+ languages covered. To prevent leakage between the pre-training examples and the downstream benchmarks. WebLI has undergone near de-duplication (Jia et al., 2021) of the images against the train, validation, and test splits of 68 common vision/vision-language datasets. For other datasets in the mixture, we performed the same de-duplication against all the downstream tasks.

3 Model Training

All \NAMEvariants are trained for one epoch over the entire pre-training dataset (1.6B) with 224 $\times$ 224 image resolution. Only the parameters of the language component are updated, the vision component is frozen, which is beneficial (Sec. 4.6). For the largest model, \NAME-17B, we perform an additional high-res (588 $\times$ 588) phase similar to previous works (Radford et al., 2021; Yuan et al., 2021; Yu et al., 2022). This phase is only for 10k steps, covering 10M examples in total, with all the parameters of \NAMEupdated. More details for training \NAMEand the ViT-e backbone are in Appendix A.1.

Experiments

We fine-tune and evaluate \NAME-3B and \NAME-15B checkpoints at 490 $\times$ 490 resolutions. For \NAME-17B, unless otherwise stated, the checkpoint produced by the two-phase pre-training is fine-tuned and evaluated at 588 $\times$ 588 resolution. For all the benchmarks, cross-entropy loss is used for fine-tuning.

We fine-tune on COCO Captions (Chen et al., 2015) on the widely adopted Karpathy split (Karpathy & Fei-Fei, 2015). \NAMEoutperforms the latest SOTA trained with cross-entropy loss (Wang et al., 2022c), and establishes a new high of CIDEr score (Vedantam et al., 2015) at 149.1 (Table 1) for models without CIDEr-optimization. NoCaps (Agrawal et al., 2019) is an evaluation benchmark for image captioning that has similar style to COCO, but targets many more visual concepts than those included in the COCO. We follow previous works by evaluating NoCaps using a model fine-tuned on COCO. \NAME-17B achieves a 124.4 CIDEr score on test, comparable to the recent result of 124.8 from GIT2 (Wang et al., 2022a). GIT2 achieves 124.2, 125.5, 122.3 on in-domain, near-domain, and out-of-domain splits of the NoCaps test set, respectively. \NAME-17B achieves 121.1, 124.4 and 126.7, respectively. This suggests that for \NAME-17B, the domain transfer from COCO to NoCaps is slightly sub-optimal compared with models pre-trained with English only. Nevertheless, \NAME-17B outperforms all prior models on recognizing and describing long-tail objects outside of COCO’s domain. TextCaps (Sidorov et al., 2020) focuses on captioning for images containing text. VizWiz-Cap (Gurari et al., 2020) contains images taken by people who are blind, which also involves scene-text understanding. We fine-tune on TextCaps and VizWiz-Cap using OCR strings generated by publicly available automatic service, similar to the protocol used in (Yang et al., 2021). Further details, including results evaluating \NAME-17B without OCR as input, are provided in Appendix C.5.

Following Thapliyal et al. (2022), we fine-tune \NAMEmodels on COCO-35L, which is COCO captions translated into 35 languages similar to CC3M-35L, before evaluating on Crossmodal-3600. We used the checkpoints pre-trained at 224 $\times$ 224 resolution and fine-tuned on COCO-35L at the same resolution. We normalize the unicode, tokenize, and remove all punctuation before calculating CIDEr scores. For languages without word boundaries such as Chinese, Japanese, Korean and Thai, a neural model is used for segmenting the text. To illustrate the range of improvements over a variety of language families with different scripts and different resources, we use seven languages in Table 2 to show their exact CIDEr scores, in addition to the 35-language average score. \NAMEoutperforms previous SOTA by large margins. Note that due to different linguistic structures, the variance of CIDEr scores across different languages does not indicate lower quality of prediction on certain languages. In Appendix C.2, we back-translate the non-English predictions to English, and demonstrated that the capability of PaLI on both English and other languages is rather consistent.

2 Visual Question Answering

All the VQA fine-tuning experiments in this paper are performed in the open-vocabulary setting using the 250k mT5 (Xue et al., 2021) vocabulary (Table 3). Most prior works, e.g. SimVLM (Wang et al., 2021), CoCa (Yu et al., 2022) and BEiT-3 (Wang et al., 2022c), use the VQA-as-classification setting, where the best answer among a predefined set (usually of size 3k) needs to be selected. Note that the VQA-as-open-generation setting is challenging because: (1) The generated text is directly compared to the desired answer and only an exact match is counted as accurate. (2) The \NAMEvocabulary covers 100+ languages and is significantly larger than both those used in the classification setting, and those used by previous single-language open-generation models (Alayrac et al., 2022; Wang et al., 2022a).

On VQAv2, \NAMEachieves 84.3 accuracy on VQAv2, and outperforms previous SOTA as follows: (1) By +2.2 accuracy points on the open-vocabulary generation setting, compared to Flamingo (Alayrac et al., 2022). (2) By +0.3 accuracy points when compared against the best result on the closed-vocabulary classification setting, BEiT-3 (Wang et al., 2022c). OKVQA requires external knowledge to answer its questions, that is, knowledge not directly present in the image input, and instead needs to be indirectly inferred by the model. \NAME-17B achieves 64.5 accuracy, pushing SOTA for the pretrain-finetune setup higher by 10.1 accuracy points, compared to KAT (Gui et al., 2021) at 54.4 accuracy. The best result for the 32-shot learning setup is from Flamingo (Alayrac et al., 2022) at 57.8 accuracy. The results from Flamingo and \NAME-17B suggest that leveraging external knowledge does not necessarily require specific training, and instead can be achieved with generic large-capacity models trained on large amounts of data. TextVQA (Singh et al., 2019), VizWiz-QA (Gurari et al., 2018) and ST-VQA (Biten et al., 2019) require the ability to perform question answering in the presence of text in the input image. We fine-tune using OCR strings generated by publicly available automatic service, similar to the protocol in TAP (Yang et al., 2021) and Mia (Qiao et al., 2021). Evaluation on TextVQA and VizWiz-QA without OCR as input is provided in Appendix C.5.

Cross-lingual and Multilingual VQA on xGQA and MaXM Both xGQA (Pfeiffer et al., 2022) and MaXM (Changpinyo et al., 2022b) are test-only VQA benchmarks that require multilingual understanding of visual questions. The setting in xGQA is cross-lingual (English-answers only), whereas for MaXM it is multilingual (answer in the same language as the question). We evaluate \NAME-17B pre-trained at 224 image resolution and fine-tuned on the native and translated VQAv2 (Goyal et al., 2017) (the Karpathy train split) in the 13 languages covered by xGQA and MaXM (VQAv2-13L) at 378 resolution. Table 4 shows significant gains on both benchmarks across all languages.

3 Language-understanding Capabilities

Since \NAMEis pre-trained with a diverse mixture of multimodal tasks with image and text data, it raises the question on whether it would “forget” its language modeling capability, and therefore exhibit inferior performance on language-understanding tasks compared to its unimodal starting checkpoint (mT5-XXL in the case of \NAME-17B). Therefore, we compare mT5-XXL and \NAME-17B on a range of language understanding benchmarks, including the English-only SuperGLUE benchmark (Wang et al., 2019a), as well as three multilingual benchmarks from the XTREME (Hu et al., 2020): XNLI (Conneau et al., 2018), which is a textual entailment task covering 14 languages, XQuAD (Artetxe et al., 2020) and TyDiQA-GoldP (Clark et al., 2020), which are both question-answering tasks covering 10 and 11 languages, respectively. For the three XTREME benchmarks, we evaluate in the zero-shot (ZS) transfer setting, whereas for SuperGLUE the models are fine-tuned (FT). Table 11 in Appendix C.1 summarizes the results. Despite the pre-training mixture heavily favoring the V&L tasks, \NAME-17B is able to maintain a high-level of language-understanding capabilities for English, and it is on-par with the state-of-the-art mT5-XXL checkpoint on the XTREME benchmarks.

4 Zero-shot Image Classification

We evaluate the \NAMEcheckpoints (without high-res phase) at 224 $\times$ 224 resolution on ImageNet and ImageNet OOD evaluation sets: ImageNet (Deng et al., 2009), ImageNet-R (Hendrycks et al., 2021a), ImageNet-A (Hendrycks et al., 2021b), ImageNet-Sketch (Wang et al., 2019b), ImageNet-v2 (Recht et al., 2019) and ObjectNet (Barbu et al., 2019). We use the same interface as for all other tasks. Instead of training a classifier on top of \NAME, we condition on the image and use \NAME’s decoder to score strings corresponding to each class directly. (See Appendix C.8 for details) The top-1 accuracies are presented in Table 5, where it clearly shows that \NAME-17B is significantly better than smaller variants. We are not aware of any previous work for large scale zero-shot evaluation on ImageNet with a generative model. However, \NAMEwith a zero-shot setting outperforms the 1-shot learning result from Flamingo (Alayrac et al., 2022).

5 Model Scaling

Due to the modular architecture, the image and language components of \NAMEcan be scaled independently. We demonstrate that jointly scaling the capacity of both components leads to performance improvements. Figure 2 quantifies this improvement across seven V&L benchmarks where we have also evaluated the \NAME-17B checkpoint without the high resolution pre-training phase for fair comparison. These improvements are noticeable both when scaling the language-model capacity (from L to XXL), and the vision-model capacity (from ViT-G to ViT-e). Figure 2 also shows that scaling the visual component is important: when scaling from a ViT-G to a ViT-e model, although the overall model size is increased by only about 13% (+2B parameters), the average performance improvement over all seven benchmarks (additional +3.2) is larger than the one obtained with much larger increases in the capacity of the language model (+3.1) which takes more parameters (+12B). The high-resolution pre-training phase at 588 $\times$ 588 resolution brings an additional +2.0 points, which also indicates the potential of scaling up the vision component of the model. This observation also resonates with the significant improvement from \NAME-15B to 17B on generative ImageNet zero-shot classification (Table 5). Table 12 shows the results of a 5B version of \NAMEwith mT5-L and ViT-e on two benchmarks, which also resonates with the finding of the benefit of joint scaling. For context, in prior work, V&L scaling is usually conducted at lower model capacity: for instance, CoCa (Yu et al., 2022) scales up to 2.1B parameters, or scaling is done primarily via the language-modeling backbone, e.g. Flamingo (Alayrac et al., 2022) scales the text backbone to 80B but the image backbone remains at 435M. Finally, on the Crossmodal-3600 benchmark, we show that scale has a large impact on multilingual performance as well (Figure 5 in the Appendix).

6 Ablation studies

We examine the composition of the task mixture and demonstrate the effectiveness of our multiple-objective mixture design. To this end, we pre-train a \NAME-3B model with 200M data coverage for each setting, before fine-tuning on a combination of English and multilingual V&L tasks (Table 6). Aside from the four tasks from our main evaluation for \NAME, we also add a VQAv2-based VQG benchmark (Akula et al., 2021). The relative weight of each components remains the same as the full mixture (Table 9). As a first observation, the split-cap objective on WebLI appears to be the most critical, across all benchmarks. Second, the object-related components also boost performance on all benchmarks. Third, the captioning objective on CC3M-35L helps on COCO; on XM-3600, its positive contribution for non-EN languages and the slight degradation for English is a reflection of CC3M-35L having a much higher non-EN example ratio (34/35) compared to WebLI alt-text (60% English, Figure 2). Fourth, adding VQA helps TextVQA; in addition, the VQG objective improves the model’s VQG capability without impacting the performance on other benchmarks. Last but not least, the OCR objective positively impacts OCR-related tasks such as TextVQA, at a slight negative impact on captioning performance. We also note that VQAv2, due to its large training set size, is much less sensitive to the change in pre-training mixture. In addition, we perform ablations to quantify the positive impact of initializing from uni-modal checkpoints, as opposed to from-scratch training (Table 14); the minor accuracy improvement from freezing the ViT backbone during pre-training (Table 15); the effect of pretraining with non-English WebLI examples on multi-(cross-)lingual performance (Table 16).

Ethics statement and broader impacts

Large models may have broader societal impact. While such models have demonstrated strong performance on public benchmarks, they might contain unknown biases or stereotypes, or propagate inaccurate or otherwise distorted information. While we have made efforts to measure some of these issues, such models need to be re-assessed carefully before being used for specific purposes. The dataset used for pre-training is automatically harvested, and filtering of the data is automatic. That process may leave undesirable images or text annotations, descriptions or concepts to be incorporated into the model. We have also attempted to train the model to operate in more than 100 languages, which we believe is an important step forward for image-language models. However, languages have various levels of data presence and coverage, so the language-generated text varies in quality depending on the language, and might contain inaccurate or undesirable outputs.

Reproducibility statements

Our model is based on open sourced components - ViT and mT5 (Dosovitskiy et al., 2021; Xue et al., 2021). Model architecture details for each component is in Section 3.1. The configuration of ViT-e when scaling is provided in Table 7 and Section A.1. We have provided training and fine-tuning details in Section 3.3 and in Section A in the Appendix. Data and model cards are also provided in the Appendix.

Acknowledgements

We would like to thank Erica Moreira, Victor Gomes, Tom Small, Sarah Laszlo, Kathy Meier-Hellstern, Susanna Ricco, Emily Denton, Bo Pang, Wei Li, Jihyung Kil, Tomer Levinboim, Julien Amelot, Zhenhai Zhu, Xiangning Chen, Liang Chen, Filip Pavetic, Daniel Keysers, Matthias Minderer, Josip Djolonga, Ibrahim Alabdulmohsin, Mostafa Dehghani, Yi Tay, Rich Lee, Austin Tarango, Elizabeth Adkison, James Cockerille, Eric Ni, Anna Davies, Maysam Moussalem, Jeremiah Harmsen, Claire Cui, Slav Petrov, Tania Bedrax-Weiss, Joelle Barral, Tom Duerig, Paul Natsev, Fernando Pereira, Jeff Dean, and Zoubin Ghahramani for helpful discussions, feedback, and support.

References

Appendix A \NAMEmodel additional information

Figure 3 visualizes some examples of \NAMEon several tasks, such as image captioning, visual question answering, OCR-oriented captioning and question answering. Examples in multiple languages are shown as well.

Below, we show more specifics about the \NAMEmodel and its components.

Table 7 lists the main \NAMEmodels used where the largest is \NAME-17B of 17B parameters.

ViT-e Backbone

We show ViT-e’s configuration in Table 8 alongside ViT-g and ViT-G for reference. Width, depth and MLP dimensions are all further scaled up in ViT-e, resulting in a model with 4B parameters. The model training setup is copied from the ViT-G model (Zhai et al., 2022a), on the JFT-3B dataset (Zhai et al., 2022a), with $16,384$ batch size, 224 $\times$ 224 resolution. We train the model for 1M steps using 0.0008 initial learning rate, with an inverse square-root learning rate decay, and a linear cool-down to zero for the final 100k steps. The only additional technique added is model souping (Wortsman et al., 2022): we run the 900K to 1M cool-down twice, once with inception cropping and once with resizing only. Thus, the final ViT-e model consists of the average weights of these two cool-downs. ViT-e is pretrained using the big_vision codebase (Beyer et al., 2022).

The overall model

The overall \NAMEmodels are implemented in JAX/Flax (Bradbury et al., 2018) using the open-source T5X (Roberts et al., 2022) and Flaxformer (Heek et al., 2020) frameworks. For the learning rate, we use a 1k-step linear warmup, followed by inverse square-root decay. For \NAME-3B, we use a peak learning rate of 1e-2. For larger models, \NAME-15B and \NAME-17B, we use a peak learning rate of 5e-3. We use the Adafactor (Shazeer & Stern, 2018) optimizer with $\beta_{1}=0$ and second-moment exponential decay set to 0.8.

The largest model, \NAME-17B, is pretrained using 1,024 GCP-TPUv4 chips for 7 days. It uses a four-way model partitioning (Roberts et al., 2022) and a batch size of 4,096. This is slightly less TPU resources than used to train other large vision and language models on TPUs. SimVLM used 2,048 GCP-TPUv3 for 5 days (Wang et al., 2021), while CoCa used 2,048 GCP-TPUv4 chips for 5 days (Yu et al., 2022). Flamingo used 1,536 GCP-TPUv4 chips for 15 days (Alayrac et al., 2022).

During training, the model passes over 1.6B images, one epoch over the entire pretraining dataset. The image resolution for this pass is 224 $\times$ 224. During training, only the parameters of the language component are updated and the vision component is frozen, which provides a boost in performance (Sec. 4.6).

A.2 The Pretraining Task Mixture

Below are detailed descriptions of each component of our task mixture.

Object detection is a generative object-detection task inspired by Chen et al. (2021; 2022). The target sequence describes bounding-box coordinates and object labels, e.g. "10 20 90 100 cat 20 30 100 100 dog". The coordinates are in the ymin xmin ymax xmax order, and range between 0 and 999. Unlike Chen et al. (2021), the prompt used contains a set of positive and negative class labels, i.e. object classes that are present and not present in the image (e.g. "detect cat and dog and leopard"). The prompt is prefixed with the word "detect". For the datasets that do not have negative class labels explicitly defined, we randomly sample non-positive class labels. Since WebLI does not contain bounding box annotations, we train on a mixture of public datasets, totalling 16M images: Open Images (Kuznetsova et al., 2020), Visual Genome (Krishna et al., 2017), and Object365 (Shao et al., 2019). The datasets are de-duplicated against evaluation tasks. These examples are included to increase object awareness capabilities of the model.

Dataset mixing ratio for pretraining Table 9 provides the data mixing ratio for pretraining all \NAMEvariants.

A.3 Fine-Tuning details

Hyperparameters for finetuning the V&L tasks We performed limited hyperparameter search for finetuning. The train steps is mostly selected based on dataset size. The batch size is selected among {128, 256, 512}, and the initial learning rate among {1e-5, 3e-5, 1e-4}. The optimizer setting for finetuning is the same as the setting for pretraining. Note that we did not perform the hyperparameter sweep over all possible combinations. Table 10 summarizes the hyperparameters corresponding to the main results.

Appendix B WebLI Dataset Details

The WebLI dataset covers about 10 billion images and 12 billion alt-texts in 109 languages. We further apply a publicly available automatic service to extract OCR annotations on all images, producing additional 29 billion image-OCR pairs. Examples and statistics for the WebLI corpus are shown in Figure 2.

Due to the scale of WebLI, to mitigate train-to-test leakage, we perform near de-duplication of the images against the train, validation, and test splits of 68 common vision/vision-language datasets. Eliminating these images from the WebLI dataset does not result in any significant shrinkage (0.36%), and avoids any potential “leakage” of examples from the pretraining setup to the downstream evaluation tasks.

To improve the data quality in terms of image-text alignment, we score image and alt-text pairs based on their cross-modal similarity. This score is measured with cosine similarity between embedding representations from each modality, computed as follows. The image embeddings are trained with a graph-based, semi-supervised representation learning approach, as described in Juan et al. (2019). Then, the text embeddings are learned using the frozen image embeddings, based on a contrastive approach using a Transformer encoder for the text, which forces both modality representations to the same embedding space.

We tune a threshold on the image and alt-text pairs’ score, and retain only the top 10% best scoring of the original WebLI image-text pairs (about 1B examples), which we use to train \NAME.

Appendix C Additional experimental results

In Table 11, we evaluate te performance of \NAMEon a range of language understanding benchmarks, in order to verify that the language-only capabilities of the model have been preserved. More specifically we compare mT5-XXL and \NAME-17B, evaluating on the English-only SuperGLUE benchmark (Wang et al., 2019a), and on three multilingual benchmarks from the XTREME (Hu et al., 2020): XNLI (Conneau et al., 2018), which is a textual entailment task covering 14 languages, XQuAD (Artetxe et al., 2020) and TyDiQA-GoldP (Clark et al., 2020), which are both question-answering tasks covering 10 and 11 languages, respectively.

C.2 Additional scaling results

Figure 5 shows that the model scaling impacts significantly the performance for multiple languages. We can see that \NAME-17B improves substantially over \NAME-3B across languages. We also include a plot where for a subset of 600 examples, we back-translate the predictions from six languages, including French, Hindi, Hebrew, Romanian, Thai and Chinese to English and compute the CIDEr score against English references for a better comparison to the English quality. The result shows that the captioning quality across languages is fairly consistent.

We also trained a 5B \NAMEmodel consisting of mT5-Large and ViT-e for additional datapoints. We evaluated this 5B model on two representative captioning and VQA benchmarks, COCO-Cap and OKVQA, and the results are shown in Table 12. We note that the training mixture and hyperparameters of this PaLI-5B checkpoint are slightly different from other PaLI sizes, but the results are still indicative and supportive of our conclusions regarding the value of joint scaling.

On COCO, the improvement from PaLI-3B to 5B (+2.1 CIDEr points) is slightly smaller than the improvement from PaLI-15B to 17B (+2.8). On OKVQA, it is likely that the benefit of having ViT-e cannot be exploited by the mT5-Large enc-dec as much as that by the mT5-XXL on VQA tasks, which require stronger language-understanding capabilities than Image Captioning tasks. In general, it is clear that scaling ViT still has much better return on investment (see the last column in Table 12), even for PaLI-5B where the ViT model is much larger than the encoder-decoder backbone. Note that we computed RoI as “improvement per 1B parameter”, using COCO and OKVQA numbers as performance indicators.

C.3 Additional ablations

Table 14 shows that initializing from unimodal checkpoint plays a critical role in \NAME’s quality. Table 15 shows that freezing ViT during pretraining leads to an improvement in downstream finetuning on COCO.

Table 16 shows the effect of the non-English part of WebLI data. The table shows two sets of comparison for the pretraining data. 1) Using only the English subset of WebLI vs using only the whole WebLI. 2) Taking out the non-EN part of WebLI from the full mix vs using the full mix. This set of comparison results is performed with a 1.5B version of \NAMEmodel, consisting of mT5-Large and ViT-L (with 300M parameters). This model has a similar parameter ratio (20% for ViT) compared with PaLI-17B (23%). Each model is pretrained to cover 200M of the data. All downstream benchmarks are fine-tuned and evaluated at 224 $\times$ 224 image resolution. The six non-En languages (6L) for XM-3600 are fr, hi, iw, ro, th and zh, and "7L" for xGQA are en, bn, de, id, ko, pt, ru, zh, both are the same as those included in Table 2 and Table 4. The takeaways are as follows:

(comparison 1, row #1 vs row #2) With only the English portion of WebLI, the model’s multilingual captioning capability remains very low (as measured on XM-3600), even with further finetuning on COCO-35L. There is also a clear drop in cross-lingual VQA performance on xGQA.

(comparison 2, row #3 vs row #4) Taking away the multilingual part of WebLI from the full mixture, which still contains other translated multilingual/cross-lingual datasets (CC3M-35L, VQ2A-CC3M-35L, VQG-CC3M-35L), still has a significant impact on XM-3600 performance. On xGQA, because of the cross-lingual training source VQ2A-CC3M-35L, the impact of removing non-EN WebLI data is reduced but still apparent. With the non-EN WebLI data in the full mix, xGQA performance improves by +0.4 overall and is better than or equal to with only the WebLI-EN in every language.

Last but definitely not least, there is an interesting result: when training with all the languages of WebLI, the model is performing better on (English) COCO captions, compared to training with English-only WebLI (about +2 CIDEr points). This suggests that 1) the multilingual WebLI may contain extra images with richer objects and their descriptions compared with the English-only subset 2) the model may be able to exploit the shared linguistic structure across languages, benefiting from transfer learning across languages.

C.4 Evaluation of \NAME’s Visual Component: ViT-e

Table 17 compares the ViT-e architecture with the smaller ViT-G and ViT-g architectures on vision only and vision-language tasks. The results suggest that V&L tasks could benefit more from scaling up the vision backbone, even on the high end. In Table 18, we fine-tune the pretrained ViT-e model on the ImageNet dataset, and then report the evaluation scores on several out-of-distribution test variants: ImageNet-v2, ObjectNet, and ReaL (Beyer et al., 2020). We follow the finetuning protocol of Zhai et al. (2022a), but use a $560\times 560$ resolution. We evaluate the fine-tuned model at $644\times 644$ (Touvron et al., 2019) (chosen according to a held-out 2% of the training set), results are reported in Table 18. ViT-e achieves 90.9% top-1 accuracy on ImageNet and shows clear benefits on the OOD benchmarks.

Since ViT-e is new and has not been evaluated in the prior work, we evaluate its standalone performance. For this, we perform supervised fine-tuning on standard classification tasks. Additionally, we perform LiT transfer (Zhai et al., 2022b) to evaluate the frozen representation quality in a zero-shot setup.

We follow LiT (Zhai et al., 2022b) to add zero-shot transfer capabilities to the (frozen) ViT-e model, the visual component of \NAME. More specifically, we tune a text encoder, when the ViT image encoder is frozen. We use the English subset of the WebLI dataset for the text encoder training, since all evaluation tasks in Table 19 are in English.

These results highlight that going from ViT-g to ViT-e provides consistently better results. Notably, LiT with ViT-e achieves 84.9% zero-shot accuracy on the challenging out-of-distribution ObjectNet test set, setting the new state-of-the-art. The VTAB-Natural benchmark (Zhai et al., 2019) consists of seven diverse natural image datasets, for which LiT also benefits from ViT-e over ViT-g. Detailed results on each VTAB-Natural task are in Appendix C.6.

We also test multilingual performance using WebLI in this setting. We further perform LiT transfer using the same multilingual WebLI dataset as used to train \NAME, and use Crossmodal-3600 to evaluate the cross-lingual image-text retrieval performance. Figure 6 shows that LiT ViT-e pretrained on the English subset substantially outperforms the same model pretrained on the multilingual dataset. The same observation applies to a few languages that are similar to English, e.g. Spanish (es), French (fr), Italian (it). However, the multilingual model performs much better on most other languages, especially those with a non-latin script such as Chinese (zh), Japanese (ja), Korean (ko), and Hebrew (iw). On average (avg), the multilingual LiT ViT-e outperforms the English-only model by a large margin. More results could be found in Table 23. These results highlight the importance of having good multilingual benchmarks to measure the benefits of training models on diverse datasets such as WebLI.

C.5 Results on TextCaps, TextVQA and VizWiz-QA without Detected OCR as Input

In the main text, we presented results on TextCaps, TextVQA, VizWiz-Cap, VizWiz-QA and ST-VQA with detected OCR strings as input. Following Kil et al. (2022), we order the OCR items based on their locations in the image, from top left to bottom right. We only include the OCR strings themselves, without the OCR-item locations provided by the API. GIT2 (Wang et al., 2022a) has demonstrated strong performance without the OCR input, while \NAME-17B shows the superiority of levaraging a specialized OCR system for a better recipe to solve these tasks.

Table 20 shows the results on TextCaps, TextVQA and VizWiz-QA without the detected OCR strings as input. \NAMEslightly suffers without OCR input, while its performance remains close to the first version of GIT. This result may suggest that the significantly larger vocab of \NAMEadds further difficulty to OCR string generation.

However, for VizWiz-QA, \NAMEestablishes SOTA performance without OCR input.

C.6 Detailed VTAB Results

For the VTAB benchmark (Zhai et al., 2019), we follow the methodology outlined in (Zhai et al., 2022b). \NAMEsets a new state-of-the-art zero-shot performance for the “natural” subset (see Table 21).

C.7 Top 5 Accuracy on Zero-shot ImageNet Datasets

Table 22 shows the Top 5 Accuracy results on Zero-shot evaluation on ImageNet Datasets.

C.8 More Zero-shot Image-text Retrieval Results on Crossmodal-3600

Table 23 shows more zero-shot image-text retrieval results on Crossmodal-3600.

Appendix D Model Fairness, Biases, and Other Potential Issues

Models trained on web data are at risk of being biased or unfair due to biases in that data. A first step towards addressing those risks is being transparent about their existence, and then measuring them. To this end, we add a data card (Pushkarna et al., 2022) for WebLI and model card (Mitchell et al., 2019) for \NAMEin Appendix G and F.

To understand the demographic properties of the data, we sample 112,782 (0.001% of the full data set, randomly sampled due to the limitations of the labeling tool, described next) examples and analyze both images and texts of the sampled data with the Know Your Data (KYD) tool. We use KYD to analyze the perceived gender presentation of image subjects (Schumann et al., 2021) along with gender expressed through pronouns in text. In the sampled images, 54% of people appear feminine presenting with 46% masculine presenting. In the sampled text, female pronouns (e.g., she, her) are used 30% of the time, male pronouns (e.g., he, him) 38% of the time, and they or them (either singular or plural) 31% of the time. We also analyze the perceived age of individuals appearing in the sampled images, resulting in the distribution displayed in Figure 7.

We consider all the effort above a first step, and know that it will be important to continue to measure and mitigate bias as we apply our model to new tasks. Deeper analysis will include the study of the model’s recognition capabilities and potential biases observed towards specific attributes, e.g. related to gender, age, etc. and how scaling affects these observations.

Appendix E Limitations

Despite good performance, our model has a number of limitations. For example, the model might not describe very thoroughly a complex scene with many objects because most of the source data does not have complex annotations. We have tried to mitigate this with the object-aware and localization aware queries, added to the data.

We also noticed that some of the multilingual capabilities are lost when fine-tuned on English-only data, which is consistent with other model fine-tuning behavior. Ideally these models should be fine-tuned on a mix of multiple datasets including multilingual ones.

There are limitations related to the evaluation procedures of the benchmarks. Since we are evaluating in the open-vocabulary generative setting, for example in VQA, the model might generate a correct response which is a synonym or a paraphrase of the target response and does not match the target exactly. In these cases the answer is counted as incorrect. Fixed-vocabulary approaches do not suffer from these issues, but are limited in generalization beyond the answers of a specific dataset. Further, in terms of evaluation, some benchmarks might need more comprehensive strategies to avoid evaluations with Western-centric bias. Multilingual models and benchmarks are a first step in that direction.

Appendix F \NAMEModel Card

Following Mitchell et al. (2019), we present the PaLI model card in Table LABEL:tab:modelcard.

Appendix G WebLI Datasheet

Following Gebru et al. (2021), we present the WebLI datasheet in Table LABEL:table:datasheet.