Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
Introduction
One key aspect of intelligence is the ability to quickly learn to perform a new task given a short instruction . While initial progress has been made towards a similar capability in computer vision, the most widely used paradigm still consists of first pretraining on a large amount of supervised data, before fine-tuning the model on the task of interest . However, successful fine-tuning often requires many thousands of annotated data points. In addition, it often requires careful per-task hyperparameter tuning and is also resource intensive. Recently, multimodal vision-language models trained with a contrastive objective have enabled zero-shot adaptation to novel tasks, without the need for fine-tuning. However, because these models simply provide a similarity score between a text and an image, they can only address limited use cases such as classification, where a finite set of outcomes is provided beforehand. They crucially lack the ability to generate language, which makes them less suitable to more open-ended tasks such as captioning or visual question-answering. Others have explored visually-conditioned language generation but have not yet shown good performance in low-data regimes.
We introduce Flamingo, a Visual Language Model (VLM) that sets a new state of the art in few-shot learning on a wide range of open-ended vision and language tasks, simply by being prompted with a few input/output examples, as illustrated in Figure 1. Of the 16 tasks we consider, Flamingo also surpasses the fine-tuned state of the art on 6 tasks, despite using orders of magnitude less task-specific training data (see Figure 2). To achieve this, Flamingo takes inspiration from recent work on large language models (LMs) which are good few-shot learners . A single large LM can achieve strong performance on many tasks using only its text interface: a few examples of a task are provided to the model as a prompt, along with a query input, and the model generates a continuation to produce a predicted output for that query. We show that the same can be done for image and video understanding tasks such as classification, captioning, or question-answering: these can be cast as text prediction problems with visual input conditioning. The difference from a LM is that the model must be able to ingest a multimodal prompt containing images and/or videos interleaved with text. Flamingo models have this capability—they are visually-conditioned autoregressive text generation models able to ingest a sequence of text tokens interleaved with images and/or videos, and produce text as output. Flamingo models leverage two complementary pre-trained and frozen models: a vision model which can “perceive” visual scenes and a large LM which performs a basic form of reasoning. Novel architecture components are added in between these models to connect them in a way that preserves the knowledge they have accumulated during computationally intensive pre-training. Flamingo models are also able to ingest high-resolution images or videos thanks to a Perceiver-based architecture that can produce a small fixed number of visual tokens per image/video, given a large and variable number of visual input features.
A crucial aspect for the performance of large LMs is that they are trained on a large amount of text data. This training provides general-purpose generation capabilities that allows these LMs to perform well when prompted with task examples. Similarly, we demonstrate that the way we train the Flamingo models is crucial for their final performance. They are trained on a carefully chosen
mixture of complementary large-scale multimodal data coming only from the web, without using any data annotated for machine learning purposes. After this training, a Flamingo model can be directly adapted to vision tasks via simple few-shot learning without any task-specific tuning.
Contributions. In summary, our contributions are the following: (i) We introduce the Flamingo family of VLMs which can perform various multimodal tasks (such as captioning, visual dialogue, or visual question-answering) from only a few input/output examples. Thanks to architectural innovations, the Flamingo models can efficiently accept arbitrarily interleaved visual data and text as input and generate text in an open-ended manner. (ii) We quantitatively evaluate how Flamingo models can be adapted to various tasks via few-shot learning. We notably reserve a large set of held-out benchmarks which have not been used for validation of any design decisions or hyperparameters of the approach. We use these to estimate unbiased few-shot performance. (iii) Flamingo sets a new state of the art in few-shot learning on a wide array of 16 multimodal language and image/video understanding tasks. On 6 of these 16 tasks, Flamingo also outperforms the fine-tuned state of the art despite using only 32 task-specific examples, around 1000 times less task-specific training data than the current state of the art. With a larger annotation budget, Flamingo can also be effectively fine-tuned to set a new state of the art on five additional challenging benchmarks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes.
Approach
This section describes Flamingo: a visual language model that accepts text interleaved with images/videos as input and outputs free-form text. The key architectural components shown in Figure 3 are chosen to leverage pretrained vision and language models and bridge them effectively. First, the Perceiver Resampler (Section 2.1) receives spatio-temporal features from the Vision Encoder (obtained from either an image or a video) and outputs a fixed number of visual tokens. Second, these visual tokens are used to condition the frozen LM using freshly initialised cross-attention layers (Section 2.2) that are interleaved between the pretrained LM layers. These new layers offer an expressive way for the LM to incorporate visual information for the next-token prediction task. Flamingo models the likelihood of text conditioned on interleaved images and videos as follows:
Vision Encoder: from pixels to features. Our vision encoder is a pretrained and frozen Normalizer-Free ResNet (NFNet) – we use the F6 model. We pretrain the vision encoder using a contrastive objective on our datasets of image and text pairs, using the two-term contrastive loss from Radford et al. . We use the output of the final stage, a 2D spatial grid of features that is flattened to a 1D sequence. For video inputs, frames are sampled at 1 FPS and encoded independently to obtain a 3D spatio-temporal grid of features to which learned temporal embeddings are added. Features are then flattened to 1D before being fed to the Perceiver Resampler. More details on the contrastive model training and performance are given in Appendix B.1.3 and Appendix B.3.2, respectively.
Perceiver Resampler: from varying-size large feature maps to few visual tokens. This module connects the vision encoder to the frozen language model as shown in Figure 3. It takes as input a variable number of image or video features from the vision encoder and produces a fixed number of visual outputs (64), reducing the computational complexity of the vision-text cross-attention. Similar to Perceiver and DETR , we learn a predefined number of latent input queries which are fed to a Transformer and cross-attend to the visual features. We show in our ablation studies (Section 3.3) that using such a vision-language resampler module outperforms a plain Transformer and an MLP. We provide an illustration, more architectural details, and pseudo-code in Appendix A.1.1.
2 Conditioning frozen language models on visual representations
Text generation is performed by a Transformer decoder, conditioned on the visual representations produced by the Perceiver Resampler. We interleave pretrained and frozen text-only LM blocks with blocks trained from scratch that cross-attend to the visual output from the Perceiver Resampler.
Interleaving new gated xattn-dense layers within a frozen pretrained LM. We freeze the pretrained LM blocks, and insert gated cross-attention dense blocks (Figure 4) between the original layers, trained from scratch. To ensure that at initialization, the conditioned model yields the same results as the original language model, we use a -gating mechanism . This multiplies the output of a newly added layer by before adding it to the input representation from the residual connection, where is a layer-specific learnable scalar initialized to . Thus, at initialization, the model output matches that of the pretrained LM, improving training stability and final performance. In our ablation studies (Section 3.3), we compare the proposed gated xattn-dense layers against recent alternatives and explore the effect of how frequently these additional layers are inserted to trade off between efficiency and expressivity. See Appendix A.1.2 for more details.
Varying model sizes. We perform experiments across three models sizes, building on the 1.4B, 7B, and 70B parameter Chinchilla models ; calling them respectively Flamingo-3B, Flamingo-9B and Flamingo-80B. For brevity, we refer to the last as Flamingo throughout the paper. While increasing the parameter count of the frozen LM and the trainable vision-text gated xattn-dense modules, we maintain a fixed-size frozen vision encoder and trainable Perceiver Resampler across the different models (small relative to the full model size). See Appendix B.1.1 for further details.
3 Multi-visual input support: per-image/video attention masking
The image-causal modelling introduced in Equation (1) is obtained by masking the full text-to-image cross-attention matrix, limiting which visual tokens the model sees at each text token. At a given text token, the model attends to the visual tokens of the image that appeared just before it in the interleaved sequence, rather than to all previous images (formalized and illustrated in Appendix A.1.3). Though the model only directly attends to a single image at a time, the dependency on all previous images remains via self-attention in the LM. This single-image cross-attention scheme importantly allows the model to seamlessly generalise to any number of visual inputs, regardless of how many are used during training. In particular, we use only up to 5 images per sequence when training on our interleaved datasets, yet our model is able to benefit from sequences of up to 32 pairs (or “shots”) of images/videos and corresponding texts during evaluation. We show in Section 3.3 that this scheme is more effective than allowing the model to cross-attend to all previous images directly.
4 Training on a mixture of vision and language datasets
We train the Flamingo models on a mixture of three kinds of datasets, all scraped from the web: an interleaved image and text dataset derived from webpages, image-text pairs, and video-text pairs.
M3W: Interleaved image and text dataset. The few-shot capabilities of Flamingo models rely on training on interleaved text and image data. For this purpose, we collect the MultiModal MassiveWeb (M3W) dataset. We extract both text and images from the HTML of approximately 43 million webpages, determining the positions of images relative to the text based on the relative positions of the text and image elements in the Document Object Model (DOM). An example is then constructed by inserting
Pairs of image/video and text. For our image and text pairs we first leverage the ALIGN dataset, composed of 1.8 billion images paired with alt-text. To complement this dataset, we collect our own dataset of image and text pairs targeting better quality and longer descriptions: LTIP (Long Text & Image Pairs) which consists of 312 million image and text pairs. We also collect a similar dataset but with videos instead of still images: VTP (Video & Text Pairs) consists of 27 million short videos (approximately 22 seconds on average) paired with sentence descriptions. We align the syntax of paired datasets with the syntax of M3W by prepending
Multi-objective training and optimisation strategy. We train our models by minimizing a weighted sum of per-dataset expected negative log-likelihoods of text, given the visual inputs:
where and are the -th dataset and its weighting, respectively. Tuning the per-dataset weights is key to performance. We accumulate gradients over all datasets, which we found outperforms a “round-robin” approach . We provide further training details and ablations in Appendix B.1.2.
5 Task adaptation with few-shot in-context learning
Once Flamingo is trained, we use it to tackle a visual task by conditioning it on a multimodal interleaved prompt. We evaluate the ability of our models to rapidly adapt to new tasks using in-context learning, analogously to GPT-3 , by interleaving support example pairs in the form of or , followed by the query visual input, to build a prompt (details in Appendix A.2). We perform open-ended evaluations using beam search for decoding, and close-ended evaluations using our model’s log-likelihood to score each possible answer. We explore zero-shot generalization by prompting the model with two text-only examples from the task, with no corresponding images. Evaluation hyperparameters and additional details are given in Appendix B.1.5.
Experiments
Our goal is to develop models that can rapidly adapt to diverse and challenging tasks. For this, we consider a wide array of 16 popular multimodal image/video and language benchmarks. In order to validate model design decisions during the course of the project, 5 of these benchmarks were used as part of our development (dev) set: COCO, OKVQA, VQAv2, MSVDQA and VATEX. Performance estimates on the dev benchmarks may be biased, as a result of model selection. We note that this is also the case for prior work which makes use of similar benchmarks to validate and ablate design decisions. To account for this, we report performance on an additional set of 11 benchmarks, spanning captioning, video question-answering, as well as some less commonly explored capabilities such as visual dialogue and multi-choice question-answering tasks. The evaluation benchmarks are described in Appendix B.1.4. We keep all evaluation hyperparameters fixed across all benchmarks. Depending on the task, we use four few-shot prompt templates we describe in more detail in Appendix B.1.5. We emphasize that we do not validate any design decisions on these 11 benchmarks and use them solely to estimate unbiased few-shot learning performance of our models.
Concretely, estimating few-shot learning performance of a model involves prompting it with a set of support samples and evaluating it on a set of query samples. For the dev benchmarks that are used both to validate design decisions and hyperparameters, as well as to report final performance, we therefore use four subsets: validation support, validation query, test support and test query. For other benchmarks, we need only the latter two. We report in Appendix B.1.4 how we form these subsets.
We report the results of the Flamingo models on few-shot learning in Section 3.1. Section 3.2 gives Flamingo fine-tuned results. An ablation study is given in Section 3.3. Appendix B.2 provides more results including Flamingo’s performance on the ImageNet and Kinetics700 classification tasks, and on our contrastive model’s performance. Appendix C includes additional qualitative results.
Few-shot results. Results are given in Table 1. Flamingo outperforms by a large margin all previous zero-shot or few-shot methods on the 16 benchmarks considered. This is achieved with as few as four examples per task, demonstrating practical and efficient adaptation of vision models to new tasks. More importantly, Flamingo is often competitive with state-of-the-art methods additionally fine-tuned on up to hundreds of thousands of annotated examples. On six tasks, Flamingo even outperforms the fine-tuned SotA despite using a single set of model weights and only 32 task-specific examples. Finally, despite having only used the dev benchmarks for design decisions, our results generalize well to the other benchmarks, confirming the generality of our approach.
Scaling with respect to parameters and shots. As shown in Figure 2, the larger the model, the better the few-shot performance, similar to GPT-3 . The performance also improves with the number of shots. We further find that the largest model better exploits larger numbers of shots. Interestingly, even though our Flamingo models were trained with sequences limited to only 5 images on M3W, they are still able to benefit from up to 32 images or videos during inference. This demonstrates the flexibility of the Flamingo architecture for processing a variable number of videos or images.
2 Fine-tuning Flamingo as a pretrained vision-language model
While not the main focus of our work, we verify that when given more data, Flamingo models can be adapted to a task by fine-tuning their weights. In Table 2, we explore fine-tuning our largest model, Flamingo, for a given task with no limit on the annotation budget. In short, we do so by fine-tuning the model on a short schedule with a small learning rate by additionally unfreezing the vision backbone to accommodate a higher input resolution (details in Appendix B.2.2). We find that we can improve results over our previously presented in-context few-shot learning results, setting a new state of the art on five additional tasks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes.
3 Ablation studies
In Table 3, we report our ablation results using Flamingo-3B on the validation subsets of the five dev benchmarks with 4 shots. Note that we use smaller batch sizes and a shorter training schedule compared to the final models. The Overall score is obtained by dividing each benchmark score by its state-of-the-art (SotA) performance from Table 1 and averaging the results. More details and results are given in Appendix B.3 and Table 10.
Importance of the training data mixture. As shown in row (i), getting the right training data plays a crucial role. In fact, removing the interleaved image-text dataset M3W leads to a decrease of more than in performance while removing the conventional paired image-text pairs also decreases performance (by ), demonstrating the need for different types of datasets. Moreover, removing our paired video-text dataset negatively affects performance on all video tasks. We ablate replacing our image-text pairs (ITP) by the publicly available LAION-400M dataset , which leads to a slight degradation in performance. We show in row (ii) the importance of our gradient accumulation strategy compared to using round-robin updates .
Visual conditioning of the frozen LM. We ablate the use of the 0-initialized tanh gating when merging the cross-attention output to the frozen LM output in row (iii). Without it, we see a drop of in our overall score. Moreover, we have noticed that disabling the 0-initialized tanh gating leads to training instabilities. Next, we ablate different conditioning architectures in row (iv). vanilla xattn, refers to the vanilla cross-attention from the original Transformer decoder . In the grafting approach from , the frozen LM is used as is with no additional layers inserted, and a stack of interleaved self-attention and cross-attention layers that take the frozen LM output are learnt from scratch. Overall, we show that our gated xattn-dense conditioning approach works best.
Compute/Memory vs. performance trade-offs. In row (v), we ablate the frequency at which we add new gated xattn-dense blocks. Although adding them at every layer is better, it significantly increases the number of trainable parameters and time complexity of the model. Notably, inserting them every fourth block accelerates training by while only decreasing the overall score by . In light of this trade-off, we maximize the number of added layers under hardware constraints and add a gated xattn-dense every fourth layer for Flamingo-9B and every seventh for Flamingo-80B. We further compare in row (vi) the Perceiver Resampler to a MLP and a vanilla Transformer given a parameter budget. Both underperform the Perceiver Resampler while also being slower.
Vision encoder. In row (vii), we compare our NFNet-F6 vision encoder pretrained with contrastive learning (details in Appendix B.1.3) to the publicly available CLIP ViT-L/14 model trained at 224 resolution. Our NFNet-F6 has a advantage over the CLIP ViT-L/14 and over a smaller NFNet-F0 encoder, which highlights the importance of using a strong vision backbone.
Freezing LM components prevents catastrophic forgetting. We verify the importance of freezing the LM layers at training in row (viii). If trained from scratch, we observe a large performance decrease of . Interestingly, fine-tuning our pretrained LM also leads to a drop in performance of . This indicates an instance of “catastrophic forgetting” , in which the model progressively forgets its pretraining while training on a new objective. In our setting, freezing the language model is a better alternative to training with the pre-training dataset (MassiveText) in the mixture.
Related work
Language modelling and few-shot adaptation. Language modelling has recently made substantial progress following the introduction of Transformers . The paradigm of first pretraining on a vast amount of data followed by an adaptation on a downstream task has become standard . In this work, we build on the 70B Chinchilla language model as the base LM for Flamingo. Numerous works have explored techniques to adapt language models to novel tasks using a few examples. These include adding small adapter modules , fine-tuning a small part of the LM , showing in-context examples in the prompt , or optimizing the prompt through gradient descent. In this paper, we take inspiration from the in-context few-shot learning technique instead of more involved few-shot learning approaches based on metric learning or meta-learning .
When language meets vision. These LM breakthroughs have been influential for vision-language modelling. In particular, BERT inspired a large body of vision-language work . We differ from these approaches as Flamingo models do not require fine-tuning on new tasks. Another family of vision-language models is based on contrastive learning . Flamingo differs from contrastive models as it can generate text, although we build and rely upon them for our vision encoder. Similar to our work are VLMs able to generate text in an autoregressive manner . Concurrent works also propose to formulate numerous vision tasks as text generation problems. Building on top of powerful pretrained language models has been explored in several recent works. One recent line of work proposes to freeze the pretrained LM weights to prevent catastrophic forgetting . We follow this idea by freezing the Chinchilla LM layers and adding learnable layers within the frozen LM. We differ from prior work by introducing the first LM that can ingest arbitrarily interleaved images, videos, and text.
Web-scale vision and language training datasets. Manually annotated vision and language datasets are costly to obtain and thus relatively small (10k-100k) in scale . To alleviate this lack of data, numerous works automatically scrape readily available paired vision-text data. In addition to such paired data, we show the importance of also training on entire multimodal webpages containing interleaved images and text as a single sequence. Concurrent work CM3 proposes to generate HTML markup from pages, while we simplify the text prediction task by only generating plain text. We emphasize few-shot learning and vision tasks while CM3 primarily evaluates on language-only benchmarks in a zero-shot or fine-tuned setup.
Discussion
Limitations. First, our models build on pretrained LMs, and as a side effect, directly inherit their weaknesses. For example, LM priors are generally helpful, but may play a role in occasional hallucinations and ungrounded guesses. Furthermore, LMs generalise poorly to sequences longer than the training ones. They also suffer from poor sample efficiency during training. Addressing these issues can accelerate progress in the field and enhance the abilities of VLMs like Flamingo.
Second, the classification performance of Flamingo lags behind that of state-of-the-art contrastive models . These models directly optimize for text-image retrieval, of which classification is a special case. In contrast, our models handle a wider range of tasks, such as open-ended ones. A unified approach to achieve the best of both worlds is an important research direction.
Third, in-context learning has significant advantages over gradient-based few-shot learning methods, but also suffers from drawbacks depending on the characteristics of the application at hand. We demonstrate the effectiveness of in-context learning when access is limited to only a few dozen examples. In-context learning also enables simple deployment, requiring only inference, generally with no hyperparameter tuning needed. However, in-context learning is known to be highly sensitive to various aspects of the demonstrations , and its inference compute cost and absolute performance scale poorly with the number of shots beyond this low-data regime. There may be opportunities to combine few-shot learning methods to leverage their complementary benefits. We discuss the limitations of our work in more depth in Appendix D.1.
Societal impacts. In terms of societal impacts, Flamingo offers a number of benefits while carrying some risks. Its ability to rapidly adapt to a broad range of tasks have the potential to enable non-expert users to obtain good performance in data-starved regimes, lowering the barriers to both beneficial and malicious applications. Flamingo is exposed to the same risks as large language models, such as outputting offensive language, propagating social biases and stereotypes, as well as leaking private information . Its ability to additionally handle visual inputs poses specific risks such as gender and racial biases relating to the contents of the input images, similar to a number of visual recognition systems . We refer the reader to Appendix D.2 for a more extensive discussion of the societal impacts of our work, both positive and negative; as well as mitigation strategies and early investigations of risks relating to racial or gender bias and toxic outputs. Finally we note that, following prior work focusing on language models , the few-shot capabilities of Flamingo could be useful for mitigating such risks.
Conclusion. We proposed Flamingo, a general-purpose family of models that can be applied to image and video tasks with minimal task-specific training data. We also qualitatively explored interactive abilities of Flamingo such as “chatting” with the model, demonstrating flexibility beyond traditional vision benchmarks. Our results suggest that connecting pre-trained large language models with powerful visual models is an important step towards general-purpose visual understanding.
This research was funded by DeepMind. We would like to thank many colleagues for useful discussions, suggestions, feedback, and advice, including: Samuel Albanie, Relja Arandjelović, Kareem Ayoub, Lorrayne Bennett, Adria Recasens Continente, Tom Eccles, Nando de Freitas, Sander Dieleman, Conor Durkan, Aleksa Gordić, Raia Hadsell, Will Hawkins, Lisa Anne Hendricks, Felix Hill, Jordan Hoffmann, Geoffrey Irving, Drew Jaegle, Koray Kavukcuoglu, Agustin Dal Lago, Mateusz Malinowski, Soňa Mokrá, Gaby Pearl, Toby Pohlen, Jack Rae, Laurent Sifre, Francis Song, Maria Tsimpoukelli, Gregory Wayne, and Boxi Wu.
References
Checklist
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]
Did you describe the limitations of your work? [Yes] See Section 5.
Did you discuss any potential negative societal impacts of your work? [Yes] See Section 5 for a brief discussion and Appendix D.2 for the full discussion.
Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]
If you are including theoretical results…
Did you state the full set of assumptions of all theoretical results? [N/A]
Did you include complete proofs of all theoretical results? [N/A]
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] The code and the data are proprietary.
Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 3 and Appendix B.
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] We do not observe large enough variance in our training runs to justify the computation cost incurred by multiple training runs. For the largest models, it is not feasible within our compute budget.
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Details can be found in Appendix B.1.2. In short, our largest run was trained on 1536 TPU chips for 15 days.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
If your work uses existing assets, did you cite the creators? [Yes] We properly cited the prior methods on which our work is based, as well as prior datasets when appropriate (e.g., ALIGN).
Did you mention the license of the assets? [N/A] The assets we used are previous work for which we cited papers. We do mention the license of all visual assets we use for the figures of the paper in Appendix G.
Did you include any new assets either in the supplemental material or as a URL? [No]
Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes] Our data was automatically scraped from million of webpages. See Datasheets in Appendix F.
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] See Datasheets in Appendix F.
If you used crowdsourcing or conducted research with human subjects…
Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]
Appendix
We provide an overview of the Appendix below.
Method (Appendix A). We first provide additional details about our model in Appendix A.1:
An illustration and pseudo-code for the Perceiver Resampler (described in Section 2.1) is provided in Appendix A.1.1 and Figure 5.
A similar illustration is provided for the gated xattn-dense layer of Section 2.2 in Appendix A.1.2 and Figure 4.
Details on our implementation of the multi-image/video attention mechanism (Section 2.3) are given in Appendix A.1.3.
Hyperparameters for all model architectures are given in Appendix A.1.4.
We then explain how we evaluate our models using in-context few-shot learning in Appendix A.2. This includes details on how we build the few-shot prompt, how we get predictions for open- and close-ended tasks, how we obtain the zero-shot numbers, and how we leverage retrieval and ensembling to take advantage of more annotated examples.
Finally, in Appendix A.3, we provide more details on our training datasets:
How we process M3W samples during training in Appendix A.3.2,
Collection of LTIP and VTP in Appendix A.3.3,
Deduplication strategy we employ to ensure that there is no leakage between our training and evaluation datasets in Appendix A.3.4.
Experiments (Appendix B). We first provide additional training and evaluation details in Appendix B.1, including:
Details on Flamingo-3B, Flamingo-9B and Flamingo in Appendix B.1.1,
The training hyperparameters in Appendix B.1.2,
More details on the Contrastive model pretraining in Appendix B.1.3,
Details on our evaluation benchmarks and splits in Appendix B.1.4,
A discussion on the few-shot learning hyperparameters in Appendix B.1.5,
The dialogue prompt used in the qualitative dialogue examples shown in Figure 1 and Figure 11 in Appendix B.1.6.
Next, we give additional results obtained by our models in Appendix B.2 including the performance of the Flamingo models on classification tasks in Appendix B.2.1, detailed fine-tuning results in Appendix B.2.2, and zero-shot results from our contrastive models (Appendix B.2.3).
Finally, we provide more ablation studies in Appendix B.3 for both the Flamingo models (Appendix B.3.1) and our contrastive pretrained Visual Encoders (Appendix B.3.2).
Qualitative results (Appendix C). More qualitative results are given in Appendix C: Figure 10 (single image sample), Figure 11 (dialogue examples), and Figure 12 (video examples).
Discussion (Appendix D). We provide a more complete discussion on our work, including limitations, failure cases, broader impacts and societal impacts of our work in Appendix D.
Model card (Appendix E). The Flamingo model card is provided in Appendix E.
Datasheets (Appendix F). Datasheets for M3W, LTIP and VTP are respectively given in Appendix F.1, Appendix F.2.1 and Appendix F.2.2.
Credit for visual content (Appendix G). We provide attribution for all visual illustrations used in the paper in Appendix G.
Appendix A Method
Expanding on our brief description in Section 2.1, Figure 5 provides an illustration of our Perceiver Resampler processing an example video, together with pseudo-code. Our Perceiver Resampler is similar in spirit to the Perceiver models proposed by Jaegle et al. . We learn a predefined number of latent input queries, and cross-attend to the flattened visual features . These visual features are obtained by first adding a learnt temporal position encoding to each feature within a given video frame (an image being considered as a single-frame video). Note that we only use temporal encodings and no explicit spatial grid position encodings; we did not observe improvements from the latter. This rationale behind is likely that CNNs, such as our NFNet encoder, are known to implicitly include spatial information channel-wise . The visual features are then flattened and concatenated as illustrated in Figure 5. The number of output tokens of the Perceiver Resampler is equal to the number of learnt latent queries. Unlike in DETR and Perceiver, the keys and values computed from the learnt latents are concatenated to the keys and values obtained from , which we found to perform slightly better.
A.1.2 gated xattn-dense details
We provide in Figure 4 an illustration of a gated xattn-dense block and how it connects to a frozen LM block, together with pseudo-code.
We also plot in Figure 6 the evolution of the absolute value of the gating values as a function of training progress (from to ) at different layers of the LM stack for the Flamingo-3B model composed of 24 LM layers. All layers of the frozen LM stack seem to utilize the visual information as the gating absolute values quickly grow in absolute value from their 0 initializations. We also note that the absolute values seem to grow with the depth. However, it is difficult to draw strong conclusions from this observation: the scale of the activations before gating may also vary with depth. Future work is required to better understand the effect of these added layers on the optimization dynamics and on the model itself.
A.1.3 Multi-visual input support
We illustrate in Figure 7 the masking approach we use to limit the number of visual tokens that a certain text token sees. We also formalize our notation for the interleaved sequences of images/videos and text.
A.1.4 Transformer architecture
We list in Table 4 the number of layers (), the hidden dimension (), the number of heads (), and the FFW activation (Act.) used for each transformer component of our Flamingo models. The dimension of keys and values in each configuration is given by (96 for the Perceiver Resampler; 128 for gated xattn-dense and the frozen LM), and the hidden dimension of each feed-forward MLP is . Note that the frozen LM was trained with the GeLU activation , while the remaining trainable transformer layers use the Squared ReLU activation , which we found to outperform GeLU.
A.2 In-context few-shot evaluation details
We evaluate the ability of our models to rapidly adapt to new tasks using in-context learning, following an analogous approach to the one used in GPT-3 . In detail, we are given a set of support examples in the form of or (where the or is the input visual and the is the expected response and any additional task-specific information, e.g., a question) and a single visual query for which we want our model to make a prediction. Given this, we build a multimodal prompt by concatenating the support examples followed by the visual query as illustrated by Figure 8. Unless specified otherwise, we choose the concatenation order at random.
In an open-ended setting, the model’s sampled text following the query image is then taken as its prediction for the image, stopping at the first
In the absence of few-shot examples, approaches commonly rely on prompt engineering to condition the model at inference using a suitable natural language description of the task. Validation of such prompts can significantly impact performance but requires access to a number of annotated examples and cannot therefore be considered truly zero-shot. Furthermore, Perez et al. have shown that such validation procedures are generally not robust with access to only a handful of samples during validation. To report zero-shot performance in our work, we instead build a prompt with two examples from the downstream tasks where we remove their corresponding images or videos. For example, for the task illustrated at the top of Figure 8, the prompt would be “
When the size of the support set exceeds a certain limit, it can become difficult to leverage all the examples with in-context learning: first because it becomes excessively expensive to fit all the examples in the prompt, and second because there is a risk of poor generalization when the prompt size exceeds the size of the sequence used during training . In such situations, it is appealing to use a form of prompt selection to both limit the sequence length as well as potentially improve the prompt quality which can in turn lead to better performance . In particular, we follow the Retrieval-based In-Context Example Selection (RICES) approach introduced by . In detail, given a query image, we retrieve similar images in the support set by comparing the visual features extracted from our frozen pretrained visual encoder. We then build the prompt by concatenating the top- most similar examples. Since LMs are sensitive to the ordering in the prompt due to recency bias , we order the examples by increasing order of similarity, such that the most similar support example appears right before the query. We notably show the effectiveness of this approach in classification settings with multiple hundreds of classes (see Appendix B.2.1) where we are given one or more images/videos per class, yielding a number of examples that would not otherwise fit in the prompt.
We also explore ensembling the outputs of the model across multiple prompts in the close-ended setting. This can notably be combined with RICES where ensembling can be done over multiple permutations of the ranked nearest neighbors. Specifically, for a given answer, we average the log likelihoods estimated by the model over 6 random permutations of the selected few-shot examples.
A.3 Training dataset details
We train the Flamingo models on a carefully chosen mixture of datasets illustrated in Figure 9 and described next.
The selection and scraping of web pages for M3W follows a similar process to the one used for collecting the MassiveWeb dataset . We start by filtering out non-English documents. We also remove those that do not pass internal filters, which identify explicit content across images, videos, and text. We use a custom scraper to extract salient content from the remaining documents, in the form of plain text interleaved with images, as described in Section 2.4. The text in M3W is collected in a similar fashion to that of MassiveWeb, but we also collect any images present at the same level in the HTML tree. We discard documents for which the scraping process does not yield any images.
We then apply similar text filtering heuristics, to remove low quality documents and reduce repetition, as well as some image filters to remove images that are too small (either width or height less than 64 pixels), too wide or narrow (aspect ratio greater than 3 in either direction), or unambiguously low quality (e.g. single-colour images). We discard documents that no longer contain any images following this filtering step.
A.3.2 M3W image-placement augmentation
During evaluation of Flamingo models, we prompt the model with an image and ask it to generate text for that image. This lends itself to a natural sequencing at inference time in which the image comes before the corresponding text output.
However, the correspondence between images and text in our interleaved M3W dataset (Section 2.4) is in general unknown (and potentially not well-defined in certain cases). As a motivating example, a simple webpage might be structured in either of the following ways:
This is my dog!
The text-aligned image indices (indices) might “ideally” be chosen such that at each point in the text, the index points to the most semantically relevant image for that text – i.e., the next image in example (a), and the previous image in example (b). In the absence of a general way to determine semantic correspondence between text and images on webpages “in the wild”, we make a simplifying assumption that the most relevant image at any given point in the text is either the last image appearing before the text token, or the image immediately following it (as in the simple examples above), and choose indices accordingly.
During training, for each webpage sampled, we sample with probability whether indices are chosen to map text to the previous or next image. This inevitably means we make the semantically “unnatural” choice – e.g., associating the text “This is my cat!” with the dog image in (a) above – around half of the time. We ablate this choice in Section 3.3, finding a small advantage to setting over either (always the previous image index) or (always the next image index). This suggests that there may be a beneficial “data augmentation” effect to this randomisation.
A.3.3 LTIP and VTP: Visual data paired with text
Along with our interleaved image and text dataset, we use several paired vision and text web datasets for training. One dataset is ALIGN , composed of 1.8 billion images paired with alt-text. ALIGN is large, but noisy and limited to images. The images are often poorly described by the corresponding alt-text annotation. For this reason, we augment it with two datasets: LTIP (Long Text & Image Pairs) consists of 312 million images, and VTP (Video & Text Pairs) consists of 27 million short videos (approximately 22 seconds on average). Both datasets are paired with more descriptive captions. For instance, the average number of tokens of an ALIGN text description is 12.4 per image, while it is 20.5 for the LTIP dataset. The LTIP and VTP datasets were collected by crawling fewer than ten websites targeting high-quality and rich image descriptions. These single-image and single-video datasets are preprocessed analogously to the M3W data preprocessing described previously, adding the
A.3.4 Dataset deduplication against evaluation tasks
We used an internal deduplication tool to deduplicate our training datasets from our evaluation datasets. This deduplication pipeline relies on a trained visual encoder which maps embedding closer together when they are potential duplicates. Once the image embeddings have been computed, a fast approximate nearest neighbor search is performed on the training images to retrieve duplicate candidates from the validation datasets. For the paired image-text dataset, we have deduplicated our LTIP and ALIGN training images against: ImageNet (train, val), COCO (train, valid, test), OK-VQA (train, valid, test), VQAv2 (train, valid, test), Flickr30k (train, valid, test), VisDial (train, valid, test).
We did not deduplicate our image datasets against VizWiz, HatefulMemes and TextVQA as we performed these evaluations only after having trained our Flamingo models. However, we believe this had no impact on our results as the images from these datasets are unlikely to be scraped from the web; VizWiz images were obtained using a specific mobile app and only available for download, HatefulMemes memes were created by researchers instead of being scraped on the web and finally TextVQA images are from OpenImages.
Note that we did not run the deduplication on the M3W dataset as one training example is a full webpage of interleaved paragraph with several images, unlikely to contain images from our benchmark suite. To verify this hypothesis, we have obtained near-duplicate statistics on the 185M individual images from M3W and the results are the following: in total, 1314 potential duplicates were found from the validation and test splits of ImageNet, COCO, OK-VQA, VQAv2, Flickr30k and VisDial. Out of the 1314 candidates, only 125 are exact duplicates.
For the video datasets, we did not perform any deduplication of VTP (27M videos) as none of the collected VTP videos were obtained from YouTube or Flickr, which are the sources of all of our video evaluation datasets collected on the Internet.
Appendix B Experiments
We perform experiments across three model sizes, where we scale the frozen language model from 1.4B to 7B and 70B; and adapt the parameter count of other components accordingly. We keep the pretrained vision encoder frozen across all experiments and use a NFNet-F6 model trained contrastively (see Appendix B.1.3), unless explicitly stated otherwise in the ablation study. We use a Perceiver Resampler with approximately 200M parameters across all three model sizes.
The decision on how many gated xattn-dense layers to interleave is mainly driven by a trade-off between memory constraints and downstream performance. We identified the optimal trade-off at small model scales, before transferring our findings to the large model architecture.
We obtain three models, Flamingo-3B, Flamingo-9B and Flamingo-80B, detailed below:
The Flamingo-3B model builds on top of a 1.4B frozen language model from . Before each transformer block, we add a gated xattn-dense layer attending to the visual inputs; this accounts for 1.4B additional learned parameters.
The Flamingo-9B model builds on top of a 7B frozen language model from . Starting from the very first layer and before every fourth transformer blocks, we add a gated xattn-dense layer attending to the visual inputs; this accounts for 1.8B additional learned parameters.
The Flamingo-80B model builds on top of the frozen Chinchilla 70B language model . Starting from the very first layer and before every seventh transformer blocks, we add a gated xattn-dense layer attending to the visual inputs; this accounts for 10B additional learned parameters. For simplicity, we refer to this model as simply Flamingo throughout the paper.
In Table 5 we report the parameter count of each component of our models, as well as model sharding requirements. We provide more Transformer architecture details in Appendix A.1.4. The Flamingo model card is also given in Appendix E.
B.1.2 Training details for the Flamingo models
Empirically we find that it is effective to stochastically prepend the paired dataset text samples with a single space character, with probability 0.5. We attribute this to the fact that our subword tokenizer maps the beginning of various words to a different token depending on whether it is preceded by a space. This allows us to enforce invariance to this tokenizer artifact, without degrading significantly correctness of the punctuation which is already lacking in many of these samples. We observe that this leads to substantial improvement across tasks.
The visual inputs are resized to while preserving their aspect ratios, padding the image with the mean value if required. Note that this is higher than the resolution used for the contrastive pretraining of our Vision Encoder (see Appendix B.1.3). The increase in resolution during the final stage training was motivated by showing one can obtain improved performance at a higher test-time resolution when using CNNs. This increase in resolution also comes with only a moderate computational and memory cost as no backpropagation is performed through the frozen Vision Encoder. We also employ random left/right flips and color augmentation.
For interleaved datasets (Section 2.4) we also employ augmentation by lightly randomizing the selected image indices with a hyperparameter when sampling examples from the M3W dataset. This augmentation is detailed in Appendix A.3.2 and our choice of is ablated in Appendix B.3.1. For video training, we temporally sample a clip of 8 frames sampled at one frame per second (fps) from each training video. Although our model was trained with a fixed number of 8 frames, at inference time, we input 30 frames at 3 FPS. This is achieved by linearly interpolating the learnt temporal position embedding of the Perceiver Resampler at inference time.
All our models are trained using the AdamW optimizer with global norm clipping of , no weight decay for the Perceiver Resampler and weight decay of 0.1 for the other trainable parameters. The learning rate is increased linearly from to up over the first 5000 steps then held constant for the duration of training (no improvements were observed from decaying the learning rate). Unless specified otherwise we train our models for steps. Four datasets are used for training: M3W, ALIGN, LTIP and VTP with weights of , , and respectively. These weights were obtained empirically at a small model scale and kept fixed afterwards. Batch sizes depend on the setting and are given in the next sections.
Our model and associated infrastructure were implemented using JAX and Haiku . All training and evaluation was performed on TPUv4 instances. The largest model containing 80 billion parameters is trained on chips for 15 days and sharded across 16 devices. Megatron type sharding is used to enable 16-way model parallelism for all Embedding / Self-Attention / Cross-Attention / FFW layers, while the NFNet vision layers were unsharded. ZeRO stage 1 is used to shard the optimizer state. All trained parameters and optimizer accumulators are stored and updated in float32; all activations and gradients are computed in bfloat16 after downcasting of parameters from float32 to bfloat16. Frozen parameters are stored and applied in bfloat16.
B.1.3 Contrastive model details
The vision encoder is trained from scratch, together with a language encoder. Using these encoders, images and text pairs are separately encoded and projected to a shared embedding space and L2 normalized. From these embeddings, we maximize the similarity of paired embeddings and minimize the similarity of unpaired embeddings, using a multi-class cross-entropy loss, where the paired image-texts are treated as positive examples and the rest of the batch as negative examples. We use the same loss as in CLIP , which consists of two contrastive losses, one from text to image and the other from image to text. We use a learnable temperature parameter in the final log-softmax layer . The text-to-image loss is as follows:
And the image-to-text loss is defined analogously:
The sum of the two losses is minimized. Here, and are, respectively, the normalized embedding of the vision and language component of the -th element of a batch. is a trainable inverse temperature parameter and is the number of elements in the batch. We use the BERT architecture for the language encoder. The outputs of the language and vision encoders are mean-pooled (across tokens and spatial locations, respectively) before being projected to the shared embedding space. We only use the weights from the contrastive vision encoder in the main Flamingo models.
The vision encoder is pretrained on the ALIGN and LTIP datasets. The training image resolution is , the joint embedding space is size and the batch size is 16,384. It is trained for million parameter update steps, each of which consist of two gradient calculation steps (more details below) on 512 TPUv4 chips. The learning rate is decayed linearly from to zero over the course of training. Images have random color augmentation and horizontal flips applied during training. We use the tokenizer employed by Jia et al. . The Adam optimizer is used to optimize the network, and we apply label smoothing of . We apply adaptive gradient clipping (AGC) to the NFNet encoder and global norm gradient clipping of 10 for the BERT encoder.
To evaluate the pretrained model, we track zero-shot image classification and retrieval. For zero-shot image classification, we use image-text retrieval between the images and the class names. Following Radford et al. we use “prompt-ensembling” in which we embed multiple texts using templates such as ‘‘A photo of a {class_name}’’ and average the resulting embedding.
B.1.4 Evaluation benchmarks
Our goal is to develop models that can rapidly adapt to diverse and challenging tasks in the few-shot setting. For this, we consider a wide array of popular image and video benchmarks summarized in Table 6. In total we chose multimodal image/video and language benchmarks, spanning tasks that require some language understanding (visual question answering, captioning, visual dialogue) as well as two standard image and video classification benchmarks (ImageNet and Kinetics). Note that for the video datasets collected from YouTube (i.e., all video datasets except NextQA and STAR), we evaluated our model on all the publicly available video as of April 2022.
In order to validate design decisions of our model over the course of the project, we selected five benchmarks from the multimodal image/video and language benchmarks as well as ImageNet and Kinetics for classification as our development set (referred as dev). To maximise its relevance, we choose the most challenging and widely studied benchmarks for captioning, visual question-answering and classification tasks on both images and videos.
Concretely, estimating few-shot learning performance of a model consists of adapting it on a set of support samples and evaluating it on a set of query samples. As a result, any evaluation set should be composed of two disjoint subsets containing respectively the support and the query samples. For the dev benchmarks that are used both to validate design decisions and hyperparameters, as well as to report final performance, we therefore use four subsets:
validation support: contains support samples for validation;
validation query: contains query samples for validation;
test support: contains support samples for final performance estimation;
test query: contains query samples for final performance estimation.
In practice, for the test query subset, we use the subset that prior works report results on, for apples-to-apples comparison. While the validation set would be a natural choice for the validation query subset, we note that this is not possible for all benchmarks, since some benchmarks do not have an official validation set (e.g. OKVQA) and for others, the validation is commonly used to report final performance in place of the test set (e.g. ImageNet or COCO). For simplicity, we use a subset of the original training set as the validation query subset. Finally, we also use additional disjoint subsets of the training set as respectively the validation support subset and the test support subset.
We now describe in more detail how we form the latter three subsets. For captioning tasks, open-ended evaluation is efficient so we evaluate on a large number of samples. Specifically, for COCO, we use the same number of samples as used in the Karpathy splits for evaluation sets (5000). For VATEX, because the training set is of limited size, we only evaluate over 1024 samples, reserving the rest for support sets. For question-answering tasks, we evaluate over 1024 samples; chosen to make both open- and close-ended evaluation reasonably fast. For image classification tasks, we evaluate over 10 images per class: 10,000 samples for ImageNet, and 7000 samples for Kinetics700. As for the support sets, for both validation and final performance estimation, we use 2048 samples across all tasks, except for classification tasks where we scale this to 32 samples per class, to better estimate expected performance for each class.
Few-shot learning performance estimates on the dev benchmarks may be biased, in the sense that over the course of this project, design decisions were made based on the performance obtained on these benchmarks. We note that this is the case for prior work which also make use of these benchmarks to validate and ablate their own design decisions. To account for this bias and provide unbiased few-shot learning performance estimates, we report performance on a remaining set of 11 benchmarks. Among those, some span the same open-ended image and video tasks as our dev benchmarks (captioning and visual question-answering). But we also look at more specific benchmarks in order to explore less explored capabilities. These notably include: TextVQA which specifically assesses OCR capabilities through question-answering; VisDial , a visual dialogue benchmark; HatefulMemes a vision and text classification benchmark; NextQA which specially focuses on causality and temporal relation; STAR , a multiple-choice question answering task; and RareAct , a benchmark measuring compositionality in action recognition. We emphasize that we do not validate any design decisions on these benchmarks and use them solely to estimate unbiased few-shot learning performance after Flamingo training is done.
B.1.5 Few-shot learning evaluation hyperparameters
In few-shot learning, hyperparameter selection implicitly increases the number of shots as it requires additional validation examples. If those are not taken into account, as is often the case in practice, few-shot performance can be overestimated . Similarly, cross-validation of benchmark-specific hyperparameters such as the prompt should be considered as a particularly basic few-shot learning method, where one selects the task-specific prompt over the set of shots. But other learning approaches might be more effective in making use of these labelled examples. Given the negative results reported by in terms of the robustness of cross-validation and unless mentioned otherwise, all benchmarks are run using a single set of evaluation hyperparameters, including the prompts. We optimize hyperparameters jointly across the validation subsets of the dev benchmarks and do not perform any benchmark-specific cross-validation of hyperparameters, aside from a few exceptions, as we detail next.
Except for HatefulMemes and RareAct, we always use the prompt “‘‘Output: {output}” for all non-question-answering tasks, and “Question: {question} Answer: {answer}” for all question-answering / visual dialogue tasks. In particular, for VisDial , we use the previously described prompt to encode each questions/answers in the dialogue and the provided image caption is prepended to the dialogue history without any prompt. For HatefulMemes , we use a specific prompt to incorporate the OCR information provided as input which is: “is an image with written: "{meme_text}" on it. Is it hateful? Answer: {answer}”, where the answer is either yes or no. Note that this is the only dataset where we explicitly provide OCR text “meme_text” as input to Flamingo models. For TextVQA, we do not make use of the provided OCR transcripts and instead directly rely on the off-the-shelf OCR capabilities of the Flamingo models. For RareAct, a zero-shot benchmark, we change the verb names to the third person, add an article before each noun and use the prompt “Caption: a person {verb + object}”.
B.1.6 Dialogue prompt
For the dialogue examples in Figures 1 and 11, the following prompt is used:
This is a conversation between a human, User, and an intelligent visual AI, Flamingo. User sends images, and Flamingo describes them.
Flamingo: That is a cat. It’s a tiny kitten with really cute big ears.
Flamingo: This is a picture of a group of people having dinner. They are having a great time!
User: Can you guess what are they celebrating?
Flamingo: They might be celebrating the end of a successful project or maybe a birthday?
Flamingo: This is a graph, it looks like a cumulative density function graph.
B.2 Additional performance results
We consider applying the Flamingo models to well-studied classification benchmarks like ImageNet or Kinetics700. Results are given in Table 7. We observe a similar pattern as in other experiments: larger model tend to perform better. Second, given that few-shot classification tasks often come with more training examples (e.g., 1000 for ImageNet with 1 example per class), using methods to scale to larger support sets is beneficial. RICES (Retrieval In-Context Example Selection described in Appendix A.2) performs substantially better than simply selecting examples randomly for inclusion in the prompt. Indeed, Flamingo achieves a improvement in ImageNet classification when selecting 16 support examples out of using RICES, compared to choosing the same number of examples randomly. Ensembling multiple prompts further boosts results. However, note that Flamingo models underperform the current dominant contrastive paradigm for classification tasks; in particular, they underperform the very contrastive model used as their vision encoder (see Appendix D.1 on Flamingo’s limitations for more details). Finally, state-of-the-art zero-shot models on ImageNet such as BASIC and LiT are particularly optimized on classification tasks as they are trained on JFT-3B , a dataset with images and labels. Improving the performance of VLMs such as Flamingo on classification tasks is an interesting direction for future work.
B.2.2 Fine-tuning Flamingo as a pretrained vision-language model
To fine-tune Flamingo models on a downstream task, we train them on data batches from the task of interest in the same format as the single-image/video datasets described in Section 2.4.
When fine-tuning Flamingo, we keep the underlying LM layers frozen and train the same Flamingo layers as during pretraining. We also increase the resolution of the input images from to . Unlike in the pretraining phase, we also fine-tune the base visual encoder, finding that this typically improves results, likely due in part to the higher input resolution.
We choose certain hyperparameters on a per-task basis by grid search on a validation subset of the training set (or on the official or standard validation set where available). These hyperparameters include the learning rate (ranging from to ) and decay schedule (exponential decay by factors of ), number of training steps, batch size (either or ), and whether visual data augmentation (color augmentation, random horizontal flips) is used.
In Table 8, we present additional results for per-task Flamingo fine-tuning. When provided access to a large-scale task-specific dataset with many thousands of examples, we find that we can improve results over our previously presented in-context few-shot learning results, setting a new state of the art on five tasks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes. For example, on VQAv2, we observe improved results at , outperforming our results achieved with 32-shot in-context learning () as well as the previous state of the art (, Yan et al. ).
Although these fine-tuning results come at high computational cost relative to the previously presented in-context few-shot learning results – among other challenges like hyperparameter tuning – they further demonstrate the power of VLM pretraining for visual understanding even in the presence of large amounts of task-specific training data.
In some cases our results likely trail the state of the art due in part to the fact that we simply optimise log-likelihood and do not make use of common task-specific metric optimisation tricks, such as CIDEr optimisation for COCO captioning, and fine-tuning on dense annotations for VisDial . For example, Murahari et al. report a relative improvement in NDCG on VisDial from such dense annotation fine-tuning.
B.2.3 Zero-shot performance of the pretrained contrastive model
A crucial part of our approach is the Vision Encoder, pretrained separately using contrastive learning and kept frozen when training Flamingo models. We report zero-shot image classification results on ImageNet, Kinetics700 and retrieval results on Flick30K and COCO. The classification results are presented in Table 7 while the retrieval results are given in Table 9. For the retrieval tasks, our model outperforms the current state-of-the-art contrastive dual encoder approaches CLIP , ALIGN and Florence . However, we underperform the zero-shot state-of-the-art on Kinetics700 (CLIP) and the zero-shot state-of-the-art on ImageNet (BASIC). However, as noted earlier, BASIC is particularly optimized for classification: it is trained on the JFT-3B dataset which has images with labels rather than captions. We have noticed training on image and short text descriptions similar to labels significantly helps for ImageNet but is detrimental for retrieval benchmarks which require capturing rich scene descriptions instead. Since our goal is to use the Vision Encoder as a feature extractor for the Flamingo models in order to capture the whole scene and not just the main object, we favor retrieval metrics over classification ones. We provide more details about the contrastive pretraining in Appendix B.1.3.
B.3 Extended ablation studies
As in Table 10, we report per-task results and the Overall score (see Section 3.3) for Flamingo-3B on the validation subsets of the 5 dev multimodal benchmarks with 4 shots in Table 10. We perform the ablation using batch size of for M3W, for ALIGN, for LTIP and for VTP. Models are trained for 1 million gradient steps (meaning 250,000 gradient updates, for the base model as we accumulate gradients over four datasets).
We further investigate the architectural design of the Resampler in row (i) of Table 10. We ablate the size of our Resampler with three options: Small, Medium (default value for all Flamingo models), and Large. We see that the best performance is achieved with a medium size Resampler. Moreover, when scaled together with the frozen LM, we observed that increasing the size of the Perceiver Resampler lead to unstable training. We thus made a conservative choice to keep the same medium Resampler size for all our Flamingo models.
In the interleaved image-text scenario, we ablate whether the model can only attend to the single most recent previous image, or to all the previous images (row (ii) of Table 10). We can see that the single image case leads to significantly better results ( better in the overall score). One potential explanation is that when attending to all previous images, there is no explicit way of disambiguating between different images in the cross-attention inputs. Nonetheless, recent work has shown that such disambiguation is still possible implicitly through the causal attention mechanism . We also explored more explicit ways to enable this while attending to all previous images by modifying the image tags to include an index (
Given a webpage, we don’t know in advance if the text of the page will mention the previous or the next image in the two-dimensional layout of the page DOM. For this reason, we explore a data augmentation on M3W controlled by which indicates whether a given text token attends to the previous or the next image (see more details in Appendix A.3.2). The default value means that for each webpage sampled, we decide uniformly at random whether the model attends to the previous or next image. means the model always attends to the previous image while means the model always attends to the following image. The results (row (iii) of Table 10) show that using this randomization is beneficial.
To measure the importance of text pretraining, we compare the performance of using a frozen decoder-only Transformer either pretrained on MassiveText (our main model) or pretrained on the C4 dataset (row (iv) of Table 10). Using the C4 dataset (which is smaller and less filtered than MassiveText) for training leads to a significant loss in performance ( overall). We note that the performance notably decreases for tasks that involve more language understanding such as visual question-answering tasks (OKVQA, VQAv2 and MSVDQA) while it remains on par for tasks that do not require as much language understanding (COCO, VATEX). This highlights the importance of pretraining the LM on a high-quality text-only dataset.
During Flamingo training, we freeze the pretrained components (Vision Encoder and LM layers) while training newly added components from scratch. We ablate in (v) of Table 10 this freezing decision by training the Vision Encoder weights either from scratch or initialized with the contrastive vision-language task. If trained from scratch, we observe that the performance decreases by a large margin of . Starting from pretrained weights still leads to a drop in performance of while also increasing the compute cost of the training.
Another approach for preventing catastrophic forgetting is to co-train on MassiveText , the dataset that was used to pretrain the language model. Specifically, we add MassiveText to the training mixture, with a weight of (best performing after a small grid search), using a sequence length of and the exact same setting as the pretraining of Chinchilla for computing the text-only training loss. In order to co-train on MassiveText, we need to unfreeze the language model but we keep the vision encoder frozen. We perform two ablations in row (vi) of Table 10: starting from a pretrained language model (with a learning rate multiplier of of the LM weights) versus initializing from scratch (with the same learning rate everywhere). In both cases, the overall scores are worse than our baseline which starts from the language model, pretrained on MassiveText, and is kept frozen throughout training. This indicates that the strategy of freezing the language model to avoid catastrophic forgetting is beneficial. Even more importantly, freezing the LM is computationally cheaper as no gradient updates of the LM weights are required and we do not need to train on an additional dataset. This computational argument is even more relevant for our largest model, Flamingo-80B, where we freeze almost of the overall weights.
In order to provide reference numbers that are more easily reproducible using publicly available datasets and network weights we also provide two additional ablations using the CLIP ViT L-14 weights and the LAION400M dataset in rows (vii) of Table 10.
B.3.2 Dataset mixing strategies for the contrastive pretraining
One key to achieving strong results was the inclusion of our new dataset LTIP alongside ALIGN for training. Despite being a smaller dataset ALIGN by a factor of 6, a contrastive model trained on only LTIP outperforms one trained only on ALIGN on our evaluation metrics, suggesting that dataset quality may be more important than scale in the regimes in which we operate. We also find that a model trained on both ALIGN and LTIP outperforms those trained on the two datasets individually and that how the datasets are combined is important.
To demonstrate this, we train a small model with an NFNet-F0 vision encoder, BERT-mini language encoder and batch size 2048 for 1 million gradient-calculation steps on ALIGN, LTIP and a mixture of the two. The results are presented in Table 11. It shows the results of training models on the combined datasets using three different merging regimes:
Data merged: Batches are constructed by merging examples from each dataset into one batch.
Round-robin: We alternate batches of each dataset, updating the parameters on each batch.
Accumulation: We compute a gradient on a batch from each dataset. These gradients are then weighted and summed and use to update the parameters.
Across all evaluation metrics, we find that the Accumulation method outperforms other methods of combining the datasets. Although the LTIP dataset is 5 smaller than the ALIGN dataset, this ablation study suggests that the quality of the training data can be more important than its abundance.
Appendix C Qualitative results
In addition to the samples in Figure 1, in this section we provide selected samples covering different interaction modalities in Figures 10, 11, and 12. Unlike the quantitative benchmark results which use beam search with a beam width of 3 for decoding, all qualitative results presented in this section use greedy decoding for faster sampling.
Figure 10 shows the simplest form of interaction where a single image is provided followed by a text prompt either in the form of a question or the start of a caption. Even though the model is not trained specifically for the question and answer format, the capabilities of the pretrained language model allows this adaptation. In many of these examples, Flamingo can do at least one step of implicit inference. Some of the objects are not named in the prompt but their properties are queried directly. Based on its visual input, the model manages to recall the knowledge relevant to the referred object and thus produces the correct answer. Vision networks trained contrastively have been shown to learn character recognition capabilities . We observe that Flamingo preserves this capability in the full model, in some cases for text that is rather small with respect to the size of the image.
Since our model can accept inputs in the form of arbitrary sequences of visuals and language, we test its abilities to hold an extended dialogue with interleaved images and text. Figure 11 shows some samples which are generated by prompting the model with a brief dialogue (Appendix B.1.6) followed by user interaction including image insertions. Even after several rounds of interaction Flamingo can still successfully attend to the image and reply to questions that can not be guessed by language alone. We observe that multiple images can be separately attended: simple comparisons and inferences are handled properly.
Lastly, we investigated similar capabilities with video inputs as they present some extra challenges compared to images. Figure 12 shows some selected samples. As seen in the figure, in some cases Flamingo can successfully integrate information from multiple frames (e.g., videos scanning through a scene or text) and answer questions involving temporal understanding (e.g., in the last example, with the word “after”).
Appendix D Discussion
Here, we describe some limitations and failure cases of our models, as well as opportunities for further improving our models and extending their abilities.
Although our visual language models have important advantages over contrastive models (e.g., few-shot learning and open-ended generation capabilities), their performance lags behind that of contrastive models on classification tasks. We believe this is because the contrastive training objective directly optimizes for text-image retrieval, and in practice, the evaluation procedure for classification can be thought of as a special case of image-to-text retrieval . This is not the case for the language modeling objective we use to train our visual language models and this may contribute to the observed performance gap on classification tasks. In particular, Zhao et al. have shown that language models suffer from various biases arising from the training data distribution, the set of samples used in the prompt, and their order. They also show that such issues can be mitigated with calibration techniques, provided one can assume a certain prior distribution (e.g., uniform) over the label space. This assumption doesn’t hold in general, and further research is needed to develop techniques to address these issues in the few-shot setting. More generally, seeking objectives, architectures, or evaluation procedures that could bridge the gap between these two classes of models is a promising research direction.
Our models build on powerful pretrained causal language models, and as a side effect, directly inherit their weaknesses. For instance, causal modeling of the conditioning inputs is strictly less expressive than bidirectional modeling. In this direction, recent work has shown that non-causal masked language modeling adaptation followed by multitask fine-tuning can efficiently improve the zero-shot performance of causal decoder-only language models. Furthermore, transformer-based language models tend to generalize poorly to test sequences significantly longer than the training ones . In settings where the expected text output is too long, the ability of the models to leverage enough shots for few-shot learning can be affected. For instance, for the VisDial dataset , a single shot consists of an image followed by a long dialogue composed of 21 different sentences. A sequence of 32 VisDial shots is thus composed of at least sentences, which in practice means that the prompt length ranges from to tokens. This is significantly longer than the maximum sequence length () our LMs have been trained on . To this end, we have capped our reported results on VisDial at 16 shots. On another note, while our ablations demonstrate the importance of the language model priors inherited from frozen language models, we suspect that they may play a role in occasional hallucinations and ungrounded guesses observed in open-ended dialogue settings. We provide and analyze examples of such behaviours in Figure 13. Finally, language modeling suffers from poor sample efficiency during pretraining . Mitigating this issue has the potential to greatly accelerate progress in the field, by improving turnaround of large-scale training runs and in turn increasing the feasibility of more systematic exploration of design decisions at larger scales. Further discussion on typical weaknesses observed for large LMs can be found in .
In the paper, we use in-context learning as our “go-to” few-shot learning method (see Section 2.5). This method has notable advantages over gradient-based approaches such as fine-tuning. Indeed, in-context learning requires almost no hyperparameter tuning, works reasonably well in the very low data regime (dozens of examples), and only requires inference, simplifying deployment. In contrast, gradient-based approaches require carefully tuned design choices to avoid overfitting (either by proper learning rate schedule or architecture design ) and often need more data (thousands) to work well. This motivated our focus on in-context learning; however, this approach also has drawbacks we discuss next.
Inference compute cost. The compute cost of in-context learning with transformer models scales linearly with the number of shots if one can reuse the few-shot prompt for multiple query samples (by caching the keys and values) and quadratically otherwise. In contrast, gradient-based few-shot learning approaches have constant complexity with respect to the number of shots during inference.
Prompt sensitivity. In-context learning has also been shown to be disconcertingly sensitive to various aspects of the demonstrations, such as the order of the samples or their format.
Leveraging more shots. When using in-context learning, performance plateaus rapidly as the number of few-shot samples increases beyond 32. This proves a striking contrast with typical gradient-based methods, for which the amount of correctly paired training data is a critical factor for performance. We note that RICES (Retrieval In-Context Example Selection described in Appendix A.2) effectively mitigates this issue for classification tasks (Appendix B.2.1), but still faces similar issues beyond a small number of example per class.
Task location. Recent work on understanding what makes in-context learning effective sheds some light on a possible explanation for why more shots do not always help . In more detail, Brown et al. raise the question of whether in-context learning actually “learns” new tasks at inference time based on the provided input-output mappings, or simply recognizes and identifies tasks learned during training. On this question, the findings of Reynolds and McDonell suggest that the latter is the key driver of performance across diverse settings, and refer it as task location. Similarly, Min et al. show that the mapping from input to output generally has limited impact on few-shot performance, as opposed to specifying the overall format of the examples. In line with these findings, we also observe non-trivial zero-shot performance using prompt without any images, hence also highlighting that the format of the task matters significantly. Intuitively, a handful of samples may often be enough to perform task location well, but the model may generally not be able to leverage further samples at inference time to refine its behaviour.
In summary, there is no “golden” few-shot method that would work well in all scenarios. In particular, the best choice of few-shot learning approach strongly depends on characteristics of the application, an important one being the number of annotated samples. On this point, in our work, we demonstrate that in-context learning is highly effective in the data-starved regime (32 samples or fewer). There may be opportunities to combine different methods to leverage their complementary benefits, in particular when targeting less data-constrained data regimes (e.g., hundreds of samples).
Natural language is a powerful and versatile input/output interface to provide descriptions of visual tasks to the model and generate outputs or estimate conditional likelihoods over possible outputs. However, it may be a cumbersome interface for tasks that involve conditioning on or predicting more structured outputs such as bounding boxes (or their temporal and spatio-temporal counterparts); as well as making spatially (or temporally and spatio-temporally) dense predictions. Furthermore, some vision tasks, such as predicting optical flow, involve predicting in continuous space, which is not something our model is designed to handle out of the box. Finally, one may consider additional modalities besides vision that may be complementary, such as audio. All of these directions have the potential to extend the range of tasks that our models can handle; and even improve performance on the ones we focus on, thanks to synergies between the corresponding abilities.
In this work, we scale Flamingo models up to 80B parameters and provide some initial insights on their scaling behaviour across evaluation benchmarks, summarized in Figure 2. In the language space, an important line of work has focused on establishing scaling laws for language models . In the vision domain, Zhai et al. take a step in this direction. Similar efforts have yet to be made for vision-language models, including contrastive models, as well as visual language models such as the ones we propose. While language modeling scaling law research has focused on perplexity as the golden metric, we speculate that it may be more directly useful for our purposes to establish such trends in terms of aggregate downstream evaluation task performance.
D.2 Benefits, risks and mitigation strategies
A system like Flamingo offers a number of potential societal benefits, some of which we will discuss in this section. Broadly, the fact that Flamingo is capable of task generalisation makes it suitable for use cases that have not been the focus of vision research historically. Typical vision systems are trained to solve a particular problem by training on large databases of manually annotated task-specific examples, making them poorly suited for applications outside of the narrow use cases for which they were deliberately trained. On the other hand, Flamingo is trained in a minimally constrained setting, endowing it with strong few-shot task induction capabilities. As we’ve shown in our qualitative examples (Appendix C), Flamingo can also be used through a “chat”-like interface for open-ended dialogue. Such capabilities could enable non-expert end users to apply models like Flamingo even to low-resource problems for which little to no task-specific training data has been collected, and where queries might be posed in a variety of formats and writing styles. In this direction, we have shown that Flamingo achieves strong performance on the VizWiz challengehttps://vizwiz.org/, which promotes visual recognition technologies to assist visually impaired people. A dialogue interface could also promote better understanding and interpretability of visual language models. It could help highlight issues with bias, fairness, and toxicity the model may pick up on from the training data. Overall, we believe that Flamingo represents an important step towards making state-of-the-art visual recognition technology more broadly accessible and useful for many diverse applications.
From a modeling perspective, although Flamingo is computationally expensive to train, it importantly leverages pretrained frozen language models and visual encoders. We demonstrated that new modalities can be introduced into frozen models, thereby avoiding expensive retraining. As such models continue to grow in size and computational demands, “recycling” them will become increasingly important from an environmental perspective (as well as a practical one), as described in Larochelle and explored in Strubell et al. for language models. We hope such results may inspire further research into how existing models can be repurposed efficiently rather than trained from scratch.
D.2.2 Risks and mitigation strategies
This section provides some early investigations of the potential risks of models like Flamingo. This study is preliminary and we foresee that further research efforts should be undertaken to better assess those risks. We also discuss potential mitigation strategies towards safely deploying these models. Note that as explained in our Model Card in Appendix E, this model was developed for research purposes only and should not be used in specific applications before proper risk analyses are conducted and mitigation strategies are explored.
Recall that a large part of our model is obtained by freezing the weights of an existing language model . In particular, if provided with no images Flamingo falls back to language model behavior. As such Flamingo is exposed to the same risks of large language models: it can output potentially offensive language, propagate social biases and stereotypes, as well as leaking private information . In particular, we refer to the analysis presented in the Chinchilla paper (Hoffmann et al. , Section 4.2.7) in terms of gender bias on the Winogender dataset which demonstrate that even though this model is less biased towards gender than previous models , gender biases are still present. In terms of unprompted toxicity, we also refer to the analysis from Chinchilla which highlights that overall the propensity of the model to produce toxic outputs when not prompted to do so is rather low, as measured by computing the PerspectiveAPI toxicity score on 25,000 samples. Weidinger et al. detail possible long-term mitigation strategies for these risks. They include social or public policy interventions, such as the creation of regulatory frameworks and guidelines; careful product design, for instance relating to user interface decisions; and research at the intersection between AI Ethics and NLP, such as building better benchmarks and improving mitigation strategies. In the short term, effective approaches include relying on prompting to mitigate any biases and harmful outputs . Next, we explore the additional risks incurred by Flamingo’s additional visual input capabilities.
Previous work has studied biases that exist in captioning systems . Such modeling biases can result in real-world harms if deployed without care. For AI systems to be useful to society as a whole, their performance should not depend on the perceived skin tone or gender of the subjects – they should work equally well for all populations. However, current automatic vision system performance has been reported to vary with race, gender or when applied across different demographics and geographic regions . As a preliminary study assessing how Flamingo’s performance varies between populations, we follow the study proposed in Zhao et al. and report how the captioning performance of our model varies on COCO as a function of gender and race. Note that we use a different evaluation protocol from the one proposed by Zhao et al. ; in that work, they measure results across 5 pretrained models and compute confidence intervals across aggregated per-model scores. Here, we have just one copy of our model (due to its high training cost), and we instead perform statistical tests on the per-sample CIDEr scores across the splits from Zhao et al. . We report the results in Table 12.
Overall, when comparing the CIDEr scores aggregated among images labeled as female versus male, as well as when comparing darker skin versus lighter skin, we find there are no statistically significant differences in the per-sample CIDEr scores. To compare the two sets of samples, we use a two-tailed -test with unequal variance, and among the four comparisons considered, the lowest -value we find is , well above typical statistical significance thresholds (e.g. a common rejection threshold might be ). This implies that the differences in scores are indistinguishable from random variation under the null hypothesis that the mean scores are equal. We note that a failure to reject the null hypothesis and demonstrate a significant difference does not imply that there are no significant differences; it is possible that a difference exists that could be demonstrated with larger sample sizes, for example. However, these preliminary results are nonetheless encouraging.
We also evaluate the toxicity of Flamingo using the Perspective API https://perspectiveapi.com/ to evaluate the toxicity of the model’s generated captions when prompted with images from the COCO test set. We observe that some captions are labelled as potentially toxic by the classifier; however, when examining them manually, we do not observe any clear toxicity – output captions are appropriate for the images provided. Overall, based on our own experiences interacting with the system throughout the course of the project, we have not observed toxic outputs when given “safe-for-work” imagery. However this does not mean the model is incapable of producing toxic outputs, especially if probed with “not-safe-for-work” images and/or toxic text. A more thorough exploration and study would be needed if such a model were put in production.
Thanks to its ability to rapidly adapt in low-resource settings, Flamingo could itself be applied in addressing some of the issues described above. For instance, following Thoppilan et al. , adequately conditioned or fine-tuned Flamingo models could be used for filtering purposes of toxic or harmful samples in the training data. In their work, they observe significant improvements relating to safety and quality when fine-tuning on the resulting data. Furthermore, during evaluation, such adapted models could be used to down-rank or exclude outputs that might be classified as offensive, promoting social biases and stereotypes or leaking private information, thus accelerating progress in this direction even for low-resource tasks. Our results on the HatefulMemes benchmark represent a promising step in this direction. Recent work in the language modeling space has also shown success in training an LM to play the role of a “red team” and generate test cases, so as to automatically find cases where another target LM behaves in a harmful way . A similar approach could be derived for our setting. Enabling the model to support outputs with reference to particular locations within the visual inputs, or to external verified quotes is also an interesting direction . Finally, in Figure 11, we provide qualitative examples demonstrating that Flamingo can explain its own outputs, suggesting avenues to explainability and interpretability using the model’s text interface.
Appendix E Flamingo Model Card
We present a model card for Flamingo in Table LABEL:tab:model-card, following the framework presented by Mitchell et al. .
Appendix F Datasheets
We follow the framework defined by Gebru et al. and provide the datasheet for M3W in Table LABEL:tab:m3w-datasheet.
F.2 Image and video text pair datasets
F.2.2 Datasheet for VTP
Appendix G Credit for visual content
Row 1: All images are provided under license by Unsplash.
Row 2: All images are under the public domain.
Row 3: First two images are provided under license by Unsplash.
Row 6: First two are provided under license by Unsplash, the third one is provided by Wikimedia Commons, licensed under CC BY-ND 2.0.
Row 7: The images are provided by Wikimedia Commons, licensed under CC BY-ND 2.0.
Row 8: The images are provided by Wikimedia Commons, licensed under CC BY-ND 2.0.
Row 9: This video is from YFCC100M, licensed under CC BY-ND 2.0.
Dialogue 2: The first icon is provided under license by Flaticon, the second image is provided under license by Unsplash, the third one is provided under license by Sketchfab.
Dialogue 4: Chicago and Tokyo pictures obtained from Unsplash.
Model Figures 3, 7, 9 and 8: All images are provided under license by Unsplash.
Qualitative Figures 10, 11, 12, and 13: All visuals are sourced from various sources including the COCO dataset, Wikimedia Commons, licensed under CC BY-ND 2.0 or available from DALL·E 2 .