Re-Imagen: Retrieval-Augmented Text-to-Image Generator
Wenhu Chen, Hexiang Hu, Chitwan Saharia, William W. Cohen
Introduction
Recent research efforts on conditional generative modeling, such as Imagen (Saharia et al., 2022), DALLE 2 (Ramesh et al., 2022), and Parti (Yu et al., 2022), have advanced text-to-image generation to an unprecedented level, producing accurate, diverse, and even create images from text prompts. These models leverage paired image-text data at Web scale (with hundreds of millions of training examples), and powerful backbone generative models, i.e., autoregressive models (Van Den Oord et al., 2017; Ramesh et al., 2021; Yu et al., 2022), diffusion models (Ho et al., 2020; Dhariwal & Nichol, 2021), etc, and generate highly realistic images. Studying these models’ generation results, we discovered their outputs are surprisingly sensitive to the frequency of the entities (or objects) in the text prompts. In particular, when generating text prompts about frequent entities, these models often generate realistic images, with faithful grounding to the entities’ visual appearance. However, when generating from text prompts with less frequent entities, those models either hallucinate non-existent entities, or output related frequent entities (see Figure 1), failing to establish a connection between the generated image and the visual appearance of the mentioned entity. This key limitation can greatly harm the trustworthiness of text-to-image models in real-world applications and even raise ethical concerns. In our studies, we found these models suffer from significant quality degradation in generating visual objects associated with under-represented groups.
In this paper, we propose a Retrieval-augmented Text-to-Image Generator (Re-Imagen), which alleviates such limitations by searching for entity information in a multi-modal knowledge base, rather than attempting to memorize the appearance of rare entities. Specifically, we define our multi-modal knowledge base encodes the visual appearances and descriptions of entities with a collection of reference
The backbone of Re-Imagen is a cascaded diffusion model (Ho et al., 2022), which contains three independent generation stages (implemented as U-Nets (Ronneberger et al., 2015)) to gradually produce high-resolution (i.e., 10241024) images. In particular, we train Re-Imagen on a dataset constructed from the image-text dataset used by Imagen (Saharia et al., 2022), where each data instance is associated with the top-k nearest neighbors within the dataset, based on text-only BM25 score. The retrieved top-k
To further quantitatively evaluate Re-Imagen, we present zero-shot text-to-image generation results on two challenging datasets: COCO (Lin et al., 2014) and WikiImages (Chang et al., 2022)The original WikiImages database contains (entity image, entity description) pairs. It was a crawled from Wikimedia Commons for visual question answering, and we repurpose it here for text-to-image generation.. Re-Imagen uses an external non-overlapping image-text database as the knowledge base for retrieval and then grounds on the retrieval to synthesize the target image. We show that Re-Imagen achieves the state-of-the-art performance for text-to-image generation on COCO and WikiImages, measured in FID score (Heusel et al., 2017), among non-fine-tuned models. For the non-entity-centric dataset COCO, the performance gain is coming from biasing the model to generate images with similar styles as the retrieved in-domain images. For the entity-centric dataset WikiImages, the performance gain comes from grounding the generation on retrieved images containing similar entities. We further evaluate Re-Imagen on a more challenging benchmark — EntityDrawBench, to test the model’s ability to generate a variety of infrequent entities (dogs, landmarks, foods, birds, animated characters) in different scenes. We compare Re-Imagen with Imagen (Saharia et al., 2022), DALL-E 2 (Ramesh et al., 2022) and StableDiffusion (Rombach et al., 2022) in terms of faithfulness and photorealism with human raters. We demonstrate that Re-Imagen can generate faithful and realistic images on 80% over input prompts, beating the existing best models by at least 30% on EntityDrawBench. Analysis shows that the improvements are mostly coming from low-frequency visual entities.
To summarize, our key contributions are: (1) a novel retrieval-augmented text-to-image model Re-Imagen, which improves FID scores on two datasets; (2) interleaved classifier-free guidance during sampling to ensure both text alignment and entity fidelity; and (3) We introduce EntityDrawBench and show that Re-Imagen can significantly improve faithfulness on less-frequent entities.
Related Work
Text-to-Image Diffusion Models There has been a wide-spread success (Ashual et al., 2022; Ramesh et al., 2022; Saharia et al., 2022; Nichol et al., 2021) in modeling text-to-image generation with diffusion models, which has outperformed GANs (Goodfellow et al., 2014) and auto-regressive Transformers (Ramesh et al., 2021) in photorealism and diversity (under similar model size), without training instability and mode collapsing issues. Among them, some recent large text-to-image models such as Imagen (Saharia et al., 2022), GLIDE (Nichol et al., 2021), and DALL-E2 (Ramesh et al., 2022) have demonstrated excellent generation from complex prompt inputs. These models achieve highly fine-grained control over the generated images with text inputs. However, they do not perform explicit grounding over external visual knowledge and are restricted to memorizing the visual appearance of every possible visual entity in their parameters. This makes it difficult for them to generalize to rare or even unseen entities. In contrast, Re-Imagen is designed to free the diffusion model from memorizing, as models are encouraged to retrieve semantic neighbors from the knowledge base and use retrievals as context to paint the image. Re-Imagen improves the grounding of the diffusion models to real-world knowledge and is therefore capable of faithful image synthesis. Concurrent Work There are several concurrent works (Li et al., 2022; Blattmann et al., 2022; Ashual et al., 2022), that also leverage retrieval to improve diffusion models. RDM (Blattmann et al., 2022) is trained similarly to Re-Imagen, using examples and near neighbors, but the neighbors in RDM are selected using image features, and at inference time retrievals are replaced with user-chosen exemplars. RDM was shown to effectively transfer artistic style from exemplars to generated images. In contrast, our proposed Re-Imagen conditions on both text and multi-modal neighbors to generate the image includes retrieval at inference time and is demonstrated to improve performance on rare images (as well as more generally). KNN-Diffusion (Ashual et al., 2022) is more closely related work to us, as it also uses retrieval to the quality of generated images. However, KNN-Diffusion uses discrete image representations, while Re-Imagen uses the raw pixels, and Re-Imagen’s retrieved neighbors can be
Model
In this section, we start with background knowledge, in the form of a brief overview of the cascaded diffusion models used by Imagen. Next, we describe the concrete technical details of how we incorporate retrieval for Re-Imagen. Finally, we discuss interleaved guidance sampling.
Diffusion models are trained to learn the image distribution by reversing the diffusion Markov chain. Theoretically, this reduces to learning to denoise into , with a time re-weighted square error loss—see Ho et al. (2020) for the complete proof:
Here, the noised image is denoted as , is the ground-truth image, is the condition, is the noise term, and . To simplify notation, we will allow the condition to include multiple conditioning signals, such as text prompts , a low-resolution image input (which is used in super-resolution), or retrieved neighboring images (which are used in Re-Imagen). Imagen (Saharia et al., 2022) uses a U-Net (Ronneberger et al., 2015) to implement . The U-Net represents the reversed noise generator as follows:
During the training, we randomly sample and image from the dataset , and minimize the difference between and according to Equation 2. At the inference time, the diffusion model uses DDPM (Ho et al., 2020) to sample recursively as follows:
The model sets as a Gaussian noise with denoting the total number of diffusion steps, and then keeps sampling in reverse until step , i.e. , to reach the final image .
For better generation efficiency, cascaded diffusion models (Ho et al., 2022; Ramesh et al., 2022; Saharia et al., 2022) use three separate diffusion models to generate high-resolution images gradually, going from low resolution to high resolution. The three models 64 model, 256 super-resolution model, and 1024 super-resolution model gradually increase the model resolution to .
Classifier-free Guidance Ho & Salimans (2021) first proposed classifier-free guidance to trade off diversity and sample quality. This sampling strategy has been widely used due to its simplicity. In particular, Imagen (Saharia et al., 2022) adopts an adjusted -prediction as follows:
where is the guidance weight. The unconditional -prediction is calculated by dropping the condition, i.e. the text prompt.
2 Generating Image with Multi-Modal Knowledge
Similar to Imagen (Saharia et al., 2022), Re-Imagen is a cascaded diffusion model, consisting of 64, 256, and 1024 diffusion models. However, Re-Imagen augments the diffusion model with the new capability of leveraging multimodal ‘knowledge’ from the external database, thus freeing the model from memorizing the appearance of rare entities. For brevity (and concreteness) we present below a high-level overview of the 64 model: the others are similar.
where and are the text-enhanced and neighbor-enhanced -predictions, respectfully. Here, is the text guidance weight and is the neighbor guidance weight. We then interleave the two guidance predictions by a certain predefined ratio . Specifically, at each guidance step, we sample a -uniform random number , and , we use , and otherwise . We can adjust to balance the faithfulness w.r.t text description or the retrieved image-text pairs.
Experiments
Re-Imagen consists of three submodels: a 2.5B 6464 text-to-image model, a 750M 256256 super-resolution model and a 400M 10241024 super-resolution model. We also have a Re-Imagen-small with 1.4B 6464 text-to-image model to understand the impact of model size.
We finetune these models on the constructed KNN-ImageText dataset. We evaluate the model under two settings: (1) automatic evaluation on COCO and WikiImages dataset, to measure the model’s general performance to generate photorealistic images, and (2) human evaluation on the newly introduced EntityDrawBench, to measure the model’s capability to generate long-tail entities.
Training and Evaluation details The fine-tuning was run for 200K steps on 64 TPU-v4 chips and completed within two days. We use Adafactor for the 64 model and Adam for the 256 super-resolution model with a learning rate of 1e-4. We set the number of neighbors =2 and set =BM25 during training. For the image-text database , we consider three different variants: (1) the in-domain training set, which contains non-overlapping small-scale in-domain image-text pairs from COCO or WikiImages, (2) the out-of-domain LAION dataset (Schuhmann et al., 2021) containing 400M
In these two experiments, we used the standard non-interleaved classifier-free guidance (Figure 2) with =1000 steps for both the 64 diffusion model and 256 super-resolution model. The guidance weight for the 64 model is swept over [1.0, 1.25, 1.5, 1.75, 2.0], while the 256256 super-resolution models’ guidance weight is swept over [1.0, 5.0, 8.0, 10.0]. We select the guidance with the best FID score, which is reported in Table 1. We also demonstrate examples in Figure 4. COCO Results COCO is the most widely-used benchmark for text-to-image generation models. Although COCO does not contain many rare entities, it does contain unusual combinations of common entities, so it is plausible that retrieval augmentation could also help with some challenging text prompts. We adopt FID (Heusel et al., 2017) score to measure image quality. Following the previous literature, we randomly sample 30K prompts from the validation set as input to the model. The generated images are compared with the reference images from the full validation set (42K). We list the results in two columns: FID-30K denotes that the model with access to the in-domain COCO train set, while Zero-shot FID-30K does not have access to any COCO data.
Re-Imagen can achieve a significant gain on FID by retrieving from external databases: roughly a 2.0 absolute FID improvement over Imagen. Its performance is even better than fine-tuned Make-A-Scene (Gafni et al., 2022). We found that Re-Imagen retrieving from OOD database achieves less gain than IND database, but still obtains a 0.4 FID improvement over Imagen. When comparing with other retrieval-augmented models, Re-Imagen is shown to outperform KNN-Diffusion and Memory-Driven T2I models by a significant margin of 11 FID score. We also note that Re-Imagen-small is also competent in the FID, which outperforms normal-sized Imagen with fewer parameters.
As COCO does not contain infrequent entities, retrievals from the in-domain database mainly provide useful ‘style knowledge’ for the model to ground on. Re-Imagen can better adapt to COCO distribution, thus achieving a better FID score. As can be seen in the upper part of from Figure 4, Re-Imagen with retrieval generates images of the same style as COCO, while without retrieval, the output is still high quality, but the style is less similar to COCO.
WikiImages Results WikiImages is constructed based on the multimodal corpus provided in WebQA (Chang et al., 2022), which consists of
From Table 1, we found that using the OOD database (LAION) actually achieves better performance than using the IND database. Unlike COCO, WikiImages contains mostly entity-focused images, thus the importance of finding relevant entities in the database is more important than distilling the styles from the training set—and since the scale of LAION-400M is 100x larger than an in-domain database, the chance of retrieving related entities is much higher, which leads to better performance. One example is depicted in the lower part of Figure 4, where the LAION retrieval finds ‘Island of San Giorgio Maggiore’, which helps the model generate the classical Renaissance-style church.
2 Entity Focused Evaluation on EntityDrawBench
Dataset Construction We introduce EntityDrawBench to evaluate the model’s capability to generate diverse sets of entities in different visual scenes. Specifically, we pick various types of visual entities (dog breeds, landmarks, foods, birds, and animated characters) from Wikipedia Commons, Google Landmarks and Fandom to construct our prompts. In total, we collect 250 entity-centric prompts for evaluation. These prompts are mostly very unique and we cannot find them on the Internet, let alone the model training data. The dataset construction details are in Appendix C. To evaluate the model’s capability to ground on broader types of entities, we also randomly select 20 objects like ‘sunglasses, backpack, vase, teapot, etc’ and write creative prompts for them. We compare our generation results with the results from DreamBooth (Ruiz et al., 2022) in Appendix H.
We use the constructed prompt as the input and its corresponding image-text pairs as the ‘retrieval’ for Re-Imagen, to generate four 10241024 images. For the other models, we feed the prompts directly also to generate four images. We pick the best image of 4 random samples to rate its Photorealism and Faithfulness by human raters. For photorealism, we rate 1 if the image is moderately realistic without noticeable artifacts. For the faithfulness measure, we rate 1 if the image is faithful to both the entity appearance and the text description.
EntityDrawBench Results We use the proposed interleaved classifier-free guidance (subsection 3.2) for the 64 diffusion model, which runs for 256 diffusion steps under a strong guidance weight of =30 for both text and neighbor conditions. For the 256 and 1024 resolution models, we use a constant guidance weight of 5.0 and 3.0, respectively, with 128 and 32 diffusion steps. The inference speed is 30-40 secs for 4 images on 4 TPU-v4 chips. We demonstrate our human evaluation results for faithfulness and photorealism in Table 2.
We can observe that Re-Imagen can in general achieve much higher faithfulness than the existing models while maintaining similar photorealism scores. When comparing with our backbone Imagen, we see the faithfulness score improves by around 40%, which indicates that our model is paying attention to the retrieved knowledge and assimilating it into the generation process.
We further partition the entities into ‘frequent’ and ‘infrequent’ categories based on their frequency (top 50% as ‘frequent’). We plot the faithfulness score for ‘frequent’ and ‘infrequent’ separately in Figure 5. We can see that Re-Imagen is less sensitive to the frequency of the input entities than the other models with only minor performance drops. This study reflects the effectiveness of text-to-image generation models on long-tail entities. More generation examples are shown in Appendix F.
3 Analysis
Comparison to Other Models We demonstrate some examples from different models in Figure 6. As can be seen, the images generated from Re-Imagen strike a good balance between text alignment and entity fidelity. Unlike image editing to perform in-place modification, Re-Imagen can transform the neighbor entities both geometrically and semantically according to the text guidance. As a concrete example, Re-Imagen generates the Braque Saint-Germain (2nd row in Figure 6) on the grass, in a different viewpoint from to the reference image.
Impact of Number of Retrievals The number of retrievals is an important factor for Re-Imagen. We vary the number of for all three datasets to understand their impact on the model performance. From Figure 7, we found that on COCO and WikiImages, increasing K from 1 to 4 does not lead to many changes in the FID score. However, on EntityDrawBench, increasing K will dramatically improve the faithfulness of generated image. It indicates the importance of having multiple images to help Re-Imagen ground on the visual entity. We provide visual examples in Appendix D.
Text and Entity Faithfulness Trade-offs In our experiments, we found that there is a trade-off between faithefulness to the text prompt and faithfulness to the retrieved entity images. Based on Equation 6, by adjusting , i.e.the proportion of and in the sampling schedule, we can control Re-Imagen so as to generate images that explore this tradeoff: decreasing will increase the entity’s entity faithfulness but decrease the text alignment. We found that having around 0.5 is usually a ‘sweet spot’ that balances both conditions.
Conclusions
We present Re-Imagen, a retrieval-augmented diffusion model, and demonstrate its effectiveness in generating realistic and faithful images. We exhibit such advantages not only through automatic FID measures on standard benchmarks (i.e., COCO and WikiImage) but also through human evaluation of the newly introduced EntityDrawBench. We further demonstrate that our model is particularly effective in generating an image from text that mentions rare entities.
Re-Imagen still suffers from well-known issues in text-to-image generation, which we review below in Ethics Statement. In addition, Re-Imagen also has some unique limitations due to the retrieval-augmented modeling. First, because Re-Imagen is sensitive to retrieved image-text pairs it is conditioned on when the retrieved image is of low quality, there will be a negative influence on the generated image. Second, Re-Imagen sometimes still fails to generate high-quality images with highly compositional prompts, where multiple entities are involved. Thirdly, the super-resolution model is still not competent at capturing low-level details of retrieved entities leading to visual distortion. In future work, we plan to further investigate the above limitations and address them.
Ethics Statement
Strong text-to-image generation models, i.e., Imagen (Saharia et al., 2022) and Parti (Yu et al., 2022), raise ethical challenges along dimensions such as the social bias. Re-Imagen is exposed to the same challenges, as we employed Web-scale datasets that are similar to the prior models.
The retrieval-augmented modeling techniques of Re-Imagen have substantially improved the controllability and attribution of the generated image. Like many basic research topics, this additional control could be used for beneficial or harmful purposes. One obvious danger is that Re-Imagen (or similar models) could be used for malicious purposes like spreading misinformation, e.g., by producing realistic images of specific people in misleading visual contexts. On the other side, additional control has many potential benefits. One general benefit is that Re-Imagen can reduce hallucination and increase the faithfulness of the generated image to the user’s intent. Another benefit is that the ability to work with tail entities makes the model more useful for minorities and other users in smaller communities: for example, Re-Imagen is more effective at generating images of landmarks famous in smaller communities or cultures and generating images of indigenous foods and cultural artifacts. We argue that this model can help decrease the frequency-caused bias in current neural network-based AI systems.
Considering such potential threats to the public, we will be cautious about code and API release. In future work, we will explore a framework for responsible use that balances the value of external auditing of research with the risks of unrestricted open access, allowing this work to be used in a safe and beneficial way.
References
Appendix A Extended literature review
As aforementioned in the main text, in this section, we provide an additional review of related works, on (1) Retrieval-augmented Generative Models; and (2) Text-guided Image Editing.
Knowledge grounding has also drawn significant attention in the natural language processing (NLP) community. Different semi-parametric models like KNN-LM (Khandelwal et al., 2019), RAG (Lewis et al., 2020), REALM (Guu et al., 2020), RETRO (Borgeaud et al., 2021) have been proposed to leverage external textual knowledge into the transformer language models. These models have demonstrated great advantages in increasing the language model’s faithfulness and reducing the computation/memory cost. Such attempts have also been made in visual tasks like image recognition (Long et al., 2022), 2-D scene reconstruction (Siddiqui et al., 2021), and image inpainting (Xu et al., 2021). Our proposed method follows the same theme to incorporate visual knowledge into a pre-trained text-to-image generation model to help the model generalize to long-tail entities or even unseen entities without scaling up the parameters.
Text-Guided Image Editing
The work of text-guided image editing aims at preserving the object’s appearance while changing certain contexts in the image. Previously, GANs (Goodfellow et al., 2014) have been used to achieve significant performance on image editing (Zhu et al., 2016; Abdal et al., 2019; Zhu et al., 2020; Roich et al., 2021; Tov et al., 2021; Wang et al., 2022; Alaluf et al., 2022). The problem is also known as inversion as it normally requires finding the initial noise vector added in the generation process. More recently, Prompt-to-Prompt (Hertz et al., 2022) propose to use pre-trained text-image models for image editing. Image editing is focused on performing in-place modifications to the input image, either changing the global styles or editing a local region specifically without modifying the object’s appearance. However, we treat the retrieved image as ‘knowledge’ and ground on it to synthesize new images. Thus, we are not restricted to in-place modifications and are able to perform more sophisticated transformations over the objects.
Appendix B WikiImages Dataset
The WikiImages dataset is taken from WebQA (Chang et al., 2022). Images were crawled from Wikimedia Commons via the Bing Visual Search API. Since lots of Wikimedia’s topics are not visually interesting, the authors seeded with natural scenes and gradually refine the search pool to obtain more interesting images. The images are mostly containing entities from Wikipedia or WikiData. However, the original dataset still contains heavy noises. Therefore, we apply further filtering to obtain the more plausible ones for image generation. Specifically, we remove all the image-text pairs with text lengths larger than 15 tokens and all the text with a date or wiki-id information.
Appendix C EntityDrawBench
For dog breeds and birds, we sample 50 from Wikipedia Commonshttps://commons.wikimedia.org/wiki/List_of_dog_breeds as our candidates. For landmarks, we sample 50 from Google Landmarks (Weyand et al., 2020) as our candidate. For foods, we sample 50 from Wikipediahttps://en.wikipedia.org/wiki/List_of_cuisines as our candidates. For film characters, we collected 50 images from Starwars from Fandomhttps://starwars.fandom.com/wiki/Main_Page. We use appropriately paired source images as the retrieved ‘knowledge’. For each entity category, we write 5 prompt templates with an entity name placeholder, which describes the entity in different scenes. Each entity will sample a template and replace the placeholder with the entity’s name to generate a prompt, which is used as input to the text-to-image generation model.
We list all the prompt templates as Figure 10.
Appendix D Impact of Retrieval Number K
We change the retrieval number K from 1 to 2 to see its impact on the model output. We show some examples in Figure 11 to demonstrate the advantage of having multiple retrievals to help the model better capture the visual appearance of the given entities.
Appendix E Classifier guidance-free sampling strategy
We demonstrate different types of sampling strategy to leverage two conditions: standard joint condition guidance sampling, weighted guidance sampling, and our proposed interleaved guidance sampling.
The standard joint condition guidance only considers the joint diffusion score to meet both conditions. In contrast, weighted guidance sampling uses the weighted sum of text-enhanced epsilon and neighbor-enhanced epsilon . Our interleaved classifier guidance switches between and , with a ratio of . We plot their conceptual difference in Figure 12. Essentially, and do not have dependency in weighted sampling, however, they are dependent in interleaved sampling. In an extreme case where and are contradictory to each other, the model will get stuck in a local region. In contrast, Interleaved sampling can alleviate this issue.
We compare 20 dog images generated from these three sampling strategies in EntityDrawBench. We vary the number of diffusion step to observe their human evaluation score curve and show case some generated outputs in Figure 13. As can be seen, the joint decoding is either dominated by the retrieval image or by the text prompt. Weighted and Interleave can help balance the two conditions to generate better images. We also found that with less sampling steps K=200, “weighted” sampling actually achieves better results than “interleaved” sampling. However, as the sampling steps increase, our proposed “interleaved” sampling achieves better human evaluation score.
Appendix F Generation Examples
We provide more generation examples in Figure 14 and Figure 15.
Appendix G Imaginary Examples
We provide generation results for imaginary scene in Figure 16.
Appendix H Comparison with DreamBooth
We also add comparison to DreamBooth (Ruiz et al., 2022). We adopt almost the same input images from DreamBooth and display our generation results in Figure 17, Figure 18 and Figure 19.
Appendix I Failure Examples
We found that Re-Imagen can also fail in a lot of cases. We demonstrate a few examples in Figure 20. As can be seen, the model sometimes has a few failure modes: (1) the text input prior is too strong like ‘Zoom’ will be interpreted as a ‘Zoom-in’ picture by the model. (2) the model cannot ground the retrieval text on the retrieval image, for example, the model believes that only the ‘beef tenderloin inside the bowl’ is ‘Escudella’ rather than the whole stew, therefore generating ‘beef tenderloin on the grass’. (3) the model can sometimes mess up two conditions, for example, the reference ‘Australian Pinscher’ and the ‘rabbit’ in the prompt gets mixed into a single object.