Muse: Text-To-Image Generation via Masked Generative Transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan

Introduction

Generative image models conditioned on text prompts have taken an enormous leap in quality and flexibility in the last few years (Ramesh et al., 2022; Nichol et al., 2021; Saharia et al., 2022; Yu et al., 2022; Rombach et al., 2022; Midjourney, 2022). This was enabled by a combination of deep learning architecture innovations (Van Den Oord et al., 2017; Vaswani et al., 2017); novel training paradigms such as masked modeling for both language (Devlin et al., 2018; Raffel et al., 2020) and vision tasks (He et al., 2022; Chang et al., 2022); new families of generative models such as diffusion (Ho et al., 2020; Rombach et al., 2022; Saharia et al., 2022) and masking-based generation (Chang et al., 2022); and finally, the availability of large scale image-text paired datasets (Schuhmann et al., 2021).

In this work, we present a new model for text-to-image synthesis using a masked image modeling approach (Chang et al., 2022). Our image decoder architecture is conditioned on embeddings from a pre-trained and frozen T5-XXL (Raffel et al., 2020) large language model (LLM) encoder. In agreement with Imagen (Saharia et al., 2022), we find that conditioning on a pre-trained LLM is crucial for photorealistic, high quality image generation. Our models (except for the VQGAN quantizer) are built on the Transformer (Vaswani et al., 2017) architecture.

We have trained a sequence of Muse models, ranging in size from 632M parameters to 3B parameters (for the image decoder; the T5-XXL model has an additional 4.6B parameters). Each model consists of several sub-models (Figure 3): First, we have a pair of VQGAN “tokenizer” models (Esser et al., 2021b), which can encode an input image to a sequence of discrete tokens as well as decode a token sequence back to an image. We use two VQGANs, one for 256x256 resolution (“low-res”) and another for 512x512 resolution (“high-res”). Second, we have a base masked image model, which contains the bulk of our parameters. This model takes a sequence of partially masked low-res tokens and predicts the marginal distribution for each masked token, conditioned on the unmasked tokens and a T5XXL text embedding. Third, we have a “superres” transformer model which translates (unmasked) low-res tokens into high-res tokens, again conditioned on T5-XXL text embeddings. We explain our pipeline in detail in Section 2.

Compared to Imagen (Saharia et al., 2022) or Dall-E2 (Ramesh et al., 2022) which are built on cascaded pixel-space diffusion models, Muse is significantly more efficient due to the use of discrete tokens; it can be thought of as a discrete diffusion process with the absorbing state ([MASK]) (Austin et al., 2021). Compared to Parti (Yu et al., 2022), a state-of-the-art autoregressive model, Muse is more efficient due to the use of parallel decoding. Based on comparisons on similar hardware (TPU-v4 chips), we estimate that Muse is more than 1010x faster at inference time than either Imagen-3B or Parti-3B models and 33x faster than Stable Diffusion v1.4 (Rombach et al., 2022) (see Section 3.2.2). All these comparisons are when images of the same size: either 256×256256\times 256 or 512×512512\times 512. Muse is also faster than Stable Diffusion (Rombach et al., 2022), in spite of both models working in the latent space of a VQGAN. We believe that this is due to the use of a diffusion model in Stable Diffusion v1.4 which requires a significantly higher number of iterations at inference time.

The efficiency improvement of Muse, however, does not come at a loss of generated image quality or semantic understanding of the input text prompt. We evaluate our output on multiple criteria, including CLIP score (Radford et al., 2021) and FID (Heusel et al., 2017). The former is a measure of image-text correspondence; and the latter a measure of image quality and diversity. Our 3B parameter model achieves a CLIP score of 0.32 and an FID score of 7.88 on the COCO (Lin et al., 2014) zero-shot validation benchmark, which compares favorably with that of other large-scale text-to-image models (see Table 2). Our 632M(base)+268M(super-res) parameter model achieves a state of the art FID score of 6.06 when trained and evaluated on the CC3M (Sharma et al., 2018) dataset, which is significantly lower than all other reported results in the literature (see Table 1). We also evaluate our generations on the PartiPrompts (Yu et al., 2022) evaluation suite with human raters, who find that Muse generates images better aligned with its text prompt 2.72.7x more often than Stable Diffusion v1.4 (Rombach et al., 2022).

Muse generates images that reflect different parts of speech in input captions, including nouns, verbs and adjectives. Furthermore, we present evidence of multi-object properties understanding, such as compositionality and cardinality, as well image style understanding. See Figure 1 for a number of these examples and our website http://muse-model.github.io for more examples. The mask-based training of Muse lends itself to a number of zero-shot image editing capabilities. A number of these are shown in Figure 2, including zero-shot, text-guided inpainting, outpainting and mask-free editing. More details are in Section 3. Our contributions are:

We present a state-of-the-art model for text-to-image generation which achieves excellent FID and CLIP scores (quantitative measures of image generation quality, diversity and alignment with text prompts).

Our model is significantly faster than comparable models due to the use of quantized image tokens and parallel decoding.

Our architecture enables out-of-the-box, zero-shot editing capabilities including inpainting, outpainting, and mask-free editing.

Model

Our model is built on a number of components. Here, we provide an overview of each of those components in the order of their training, while relegating many details of the architecture and parameters to the Appendix. Figure 3 provides an overview of the model architecture.

Similar to the findings in (Saharia et al., 2022), we find that leveraging a pre-trained large language model (LLM) is beneficial to high-quality image generation. The embeddings extracted from an LLM such as T5-XXL (Raffel et al., 2020) carry rich information about objects (nouns), actions (verbs), visual properties (adjectives), spatial relationships (prepositions), and other properties such as cardinality and composition. Our hypothesis is that the Muse model learns to map these rich visual and semantic concepts in the LLM embeddings to the generated images; it has been shown in recent work (Merullo et al., 2022) that the conceptual representations learned by LLM’s are roughly linearly mappable to those learned by models trained on vision tasks. Given an input text caption, we pass it through the frozen T5-XXL encoder, resulting in a sequence of 4096 dimensional language embedding vectors. These embedding vectors are linearly projected to the hidden size of our Transformer models (base and super-res).

2 Semantic Tokenization using VQGAN

A core component of our model is the use of semantic tokens obtained from a VQGAN (Esser et al., 2021b) model. This model consists of an encoder and an decoder, with a quantization layer that maps an input image into a sequence of tokens from a learned codebook. We build our encoder and decoder entirely with convolutional layers to support encoding images from different resolutions. The encoder has several downsampling blocks to reduce the spatial dimension of the input, while the decoder has the corresponding number of upsampling blocks to map the latents back into original image size. Given an image of size H×WH\times W, the encoded token is of size \nicefracHf×\nicefracWf\nicefrac{{H}}{{f}}\times\nicefrac{{W}}{{f}}, with downsampling ratio ff. We train two VQGAN models: one with downsampling ratio f=16f=16 and the other with downsampling ratio f=8f=8. We obtain tokens for our base model using the f=16f=16 VQGAN model on 256×\times256 pixel images, thus resulting in tokens with spatial size 16×1616\times 16. We obtain the tokens for our super-resolution model using the f=8f=8 VQGAN model on 512×512512\times 512 images, and the corresponding token has spatial size 64×6464\times 64. As mentioned in previous work (Esser et al., 2021b), the resulting discrete tokens after encoding capture higher-level semantics of the image, while ignoring low level noise. Furthermore, the discrete nature of these tokens allows us to use a cross-entropy loss at the output to predict masked tokens in the next stage.

3 Base Model

Our base model is a masked transformer(Vaswani et al., 2017; Devlin et al., 2018), where the inputs are the projected T5 embeddings and image tokens. We leave all the text embeddings unmasked and randomly mask a varying fraction of image tokens (see Section 2.6) and replace them with a special [MASK]token (Chang et al., 2022). We then linearly map image tokens into image input embeddings of the required Transformer input/hidden size along with learned 2D positional embeddings. Following previous transformer architecture (Vaswani et al., 2017), we use several transformer layers including self-attention block, cross-attention block and MLP block to extract features. At the output layer, an MLP is used to convert each masked image embedding to a set of logits (corresponding to the VQGAN codebook size) and a cross-entropy loss is applied with the ground truth token label as the target. At training, the base model is trained to predict all masked tokens at each step. However, for inference, mask prediction is performed in an iterative manner which significantly increases quality. See Section 2.8 for details.

4 Super-Resolution Model

We found that directly predicting 512×512512\times 512 resolution leads the model to focus on low-level details over large-scale semantics. As a result we found it beneficial to use a cascade of models: first a base model that generates a 16×1616\times 16 latent map (corresponding to a 256×256256\times 256 image), followed by a super-resolution model that upsamples the base latent map to a 64×6464\times 64 latent map (corresponding to a 512×512512\times 512 image). The super-res model is trained after the base model has been trained.

As mentioned in Section 2.2, we trained two VQGAN models, one at 16×1616\times 16 latent resolution and 256×256256\times 256 spatial resolution, and the second at 64×6464\times 64 latent resolution and 512×512512\times 512 spatial resolution. Since our base model outputs tokens corresponding to a 16×1616\times 16 latent map, our super-resolution procedure learns to “translate” the lower-resolution latent map to the higher-resolution latent map, followed by decoding through the higher-resolution VQGAN to give the final high-resolution image. This latent map translation model is also trained with text conditioning and cross-attention in an analogous manner to the base model, as shown in Figure 4.

5 Decoder Finetuning

To further improve our model’s ability to generate fine details, we increase the capacity of the VQGAN decoder by the addition of more residual layers and channels while keeping the encoder capacity fixed. We then finetune the new decoder layers while keeping the VQGAN encoder weights, codebook and transformers (i.e., base model and super resolution model) frozen. This allows us to improve our visual quality without re-training any of the other model components (because the visual token “language” stays fixed). This is shown in Figure 13 in the Appendix, where we see that the finetuned decoder can reconstruct more sharper details in the store front. We also give details of the finetuned decoder architecture in the Appendix.

6 Variable Masking Rate

As was done in (Chang et al., 2022), we train our model with a variable masking rate based on a Cosine scheduling: for each training example, we sample a masking rate rr\in from a truncated arccos\arccos distribution with density function p(r)=2π(1r2)12p(r)=\frac{2}{\pi}(1-r^{2})^{-\frac{1}{2}}. This has an expected masking rate of 0.64, with a strong bias towards higher masking rates. The bias towards higher masking rates makes the prediction problem harder. In contrast with autoregressive approaches, which learn conditional distributions P(xix<i)P(x_{i}|x_{<i}) for some fixed ordering of tokens, random masking with a variable masking ratio allows our models to learn P(xixΛ)P(x_{i}|x_{\Lambda}) for arbitrary subsets of tokens Λ\Lambda. This is not only critical for our parallel sampling scheme, but it also enables a number of zero-shot, out-of-the-box editing capabilities, such as shown in Figure 2 and Section 3.3.

7 Classifier Free Guidance

Intuitively, CFG trades off diversity for fidelity. Different from previous approaches, we reduce the hit to diversity by linearly increasing the guidance scale tt through the sampling procedure. This allows the early tokens to be sampled more freely, with low or no guidance, but increases the influence of the conditioning prompt for the later tokens.

8 Iterative Parallel Decoding at Inference

The critical component for our model’s inference time efficiency is the use of parallel decoding to predict multiple output tokens in a single forward pass. The key assumption underlying the effectiveness of the parallel decoding is a Markovian property that many tokens are conditionally independent given other tokens. Decoding is performed based on a cosine schedule (Chang et al., 2022) that chooses a certain fixed fraction of the highest confidence masked tokens that are to be predicted at that step. These tokens are then set to unmasked for the remainder of the steps and the set of masked tokens is appropriately reduced. Using this procedure, we are able to perform inference of 256256 tokens using only 2424 decoding steps in our base model and 40964096 tokens using 88 decoding steps in our super-resolution model, as compared to the 256 or 4096 steps required for autoregressive models (e.g. (Yu et al., 2022)) and hundreds of steps for diffusion models (e.g., (Rombach et al., 2022; Saharia et al., 2022)). We note that recent methods including progressive distillation (Salimans & Ho, 2022) and better ODE solvers (Lu et al., 2022) have greatly reduced the sampling steps of diffusion models, but they have not been widely validated in large scale text-to-image generation. We leave the comparison to these faster methods in the future work, while noting that similar distillation approaches are also a possibility for our model.

Results

We train a number of base Transformer models at different parameter sizes, ranging from 600M to 3B parameters. Each of these models is fed in the output embeddings from a T5-XXL model, which is pre-trained and frozen and consists of 4.6B parameters. Our largest base model of 3B parameters consists of 4848 Transformer layers with cross-attention from text to image and self-attention among image tokens. All base models share the same image tokenizer. We use a CNN model with 1919 ResNet blocks and a quantized codebook of size 81928192 for the tokenization. Larger codebook sizes did not result in performance improvements. The super-resolution model consists of 3232 multi-axis Transformer layers (Zhao et al., 2021) with cross-attention from concatenated text and image embedding to high resolution image and self-attention among high resolution image tokens. This model converts a sequence of tokens from one latent space to another: the first latent space being that of the base model tokenizer, a latent space of 16×1616\times 16 tokens, to that of a higher resolution tokenizer with 64×6464\times 64 tokens. After token conversion, the decoder for the higher resolution tokenizer is used to convert to the higher resolution image space. Further details of configurations are provided in the appendix.

We train on the Imagen dataset consisting of 460M text-image pairs (Saharia et al., 2022). Training is performed for 1M steps, with a batch size of 512 on 512-core TPU-v4 chips (Jouppi et al., 2020). This takes about 1 week of training time. We use the Adafactor optimizer (Shazeer & Stern, 2018) to save on memory consumption which allowed us to fit a 3B parameter model without model parallelization. We also avoid performing exponential moving averaging (EMA) of model weights during training, again to save on TPU memory. In order to reap the benefits of EMA, we checkpoint every 5000 steps, then perform EMA offline on the checkpointed weights with a decay factor of 0.7. These averaged weights form the final base model weights.

Figure 6 qualitatively demonstrates the capabilities of Muse for text prompts with different properties. The top left of Figure 6 shows examples that demonstrate a basic understanding of cardinality. For objects with non-unity cardinality, instead of generating the same object pixels multiple times, Muse instead adds contextual variations to make the overall image more realistic, e.g., elephant size and orientation, wine bottle wrapper color, and tennis ball rotation. The top right of Fig, 6 demonstrates understanding of multi-object composition and relativeness. Instead of placing objects at random locations, Muse generates images that preserve prepositional object relations in the text, e.g., on vs under, left vs right, etc. The middle left of Figure 6 demonstrates its ability to generate images spanning many styles, both specific to a renowned artist (e.g., Rembrandt) as well as general to a style as a whole (e.g., pop art and Chinese ink and wash). The middle right of Figure 6 demonstrates the ability of Muse to render words and phrases. Text generation is fundamentally different than generating most other objects. Instead of the model learning a mapping between an object name and its characteristics (e.g., that “elephant” maps to “large”, “gray”, and “peanut eating”), the virtual continuum of possible words and phrases demands that the model learn differently. It must instead learn a hierarchical understanding between phrases, words, and letters. The bottom left of Figure 6 demonstrates that Muse uses the entirety of a text prompt when rendering instead of focusing exclusively on only a few salient words. Finally, Figure 7 shows comparisons between Muse, Dall-E 2 (Ramesh et al., 2022), and Imagen (Saharia et al., 2022) for some select prompts, showing that Muse is at par with Imagen and qualitatively better than Dall-E2 for many prompts.

However, as demonstrated in the bottom right of Figure 6, Muse is limited in its ability to generate images well aligned with certain types of prompts. For prompts which indicate that long, multi-word phrases should be directly rendered, Muse has a tendency to render those phrases incorrectly, often resulting in (unwanted) duplicated rendered words or rendering of only a portion of the phrase. Additionally, prompts indicating high object cardinality tend to result in generated images which do not correctly reflect that desired cardinality (e.g., rendering only 77 wine bottles when the prompt specified 1010). In general, the ability of Muse to render the correct cardinalities of objects decreases as the cardinality increases. Another difficult prompt type for Muse is ones with multiple cardinalities (e.g., “four cats and a team of three dogs”). For such cases, Muse has a tendency to get at least one cardinality incorrect in its rendering.

2 Quantitative Performance

In Table 1 and Table 2, we show our performance against other methods on the CC3M (Sharma et al., 2018) and COCO (Lin et al., 2014) datasets as measured by Fréchet Inception Distance (FID) (Heusel et al., 2017), which measures quality and diversity of samples, as well as CLIP (Radford et al., 2021) score, which measures image/text alignment. For the CC3M results, both Muse models were trained on CC3M. The COCO results are zero-shot, using a model trained on the same dataset as Imagen (Saharia et al., 2022).

Our 632M model achieves SOTA results on CC3M, significantly improving upon the state of the art in FID score, and also achieving state of the art CLIP score. Our 3B model achieves an FID score of 7.88 which is slightly better than the score of 8.18.1 achieved by the Parti-3B model which has a similar number of parameters. Our CLIP score of 0.32 is higher than the CLIP score of 0.29 achieved by Imagen (which is achieved when the FID is significantly higher 20). For the FID of 7.27, Imagen achieves a CLIP score of around 0.27 (see Figure 4 in (Saharia et al., 2022)).

Our sampling algorithm (Section 2.8) has a number of hyperparameters, such as guidance scale, sampling temperature, whether or not to linearly increase guidance during sampling, etc. We perform evaluation sweeps over these parameters. We find subsets of sampling parameters that are Pareto efficient, in the sense that we cannot improve FID without hurting CLIP. This allows us to study the tradeoff between diversity and image/text alignment, which we show in Figure 9.

Similar to previous works (Yu et al., 2022; Saharia et al., 2022), we perform side-by-side evaluations in which human raters are presented with a text prompt and two images, each generated by a different text-to-image model using that prompt. The raters are asked to assess prompt-image alignment via the question, “Which image matches with the caption better?” Each image pair is anonymized and randomly ordered (left vs right). Raters have the option of choosing either image or that they are indifferentChoosing indifference makes sense when neither image is aligned with the text prompt and helps reduce statistical noise in the results.. Each (prompt, image pair) triplet is assessed by five independent raters; the raters were provided through the Google internal crowd computing team and were completely anonymous to the Muse team. For the set of prompts presented to raters, we used PartiPrompts (Yu et al., 2022), a collection of 16501650 text prompts curated to measure model capabilities across a variety of categories. For the two text-to-image models, we compared Muse (33B parameters) to that of Stable Diffusion v1.4 (Rombach et al., 2022), the text-to-image model most comparable to Muse in terms of inference speed. For each prompt, 1616 image instances were generated, and the one with the highest CLIP score (Radford et al., 2021) was used. The stable diffusion images were generated via the CompVis Stable Diffusion v1.4 notebook (CompVis, 2022). We required at least a 33 rater consensus for results to be counted in favor of a particular model. From this analysis, we found that Muse was chosen as better aligned than Stable Diffusion for 70.670.6% of the prompts, Stable Diffusion was chosen as better aligned than Muse for 25.425.4%, and no rater consensus was chosen for 44%. These results are consistent with Muse having significantly better caption matching capability ( ⁣2.7\sim\!2.7x). Figure 9 shows a breakdown of the rater results for rater consensuses of 33, 44, and all 55 possible votes. Prompts for which all 55 raters said Muse had better alignment than Stable Diffusion are the larger contributor.

In addition to measuring alignment, other works (Yu et al., 2022; Saharia et al., 2022) have also measured image realism, often via a rater question similar to, “Which image is more realistic?”. However, we note that care must be taken with examination of such results. Though it is not the intent of the question, a model that is completely mode collapsed so that it generates the same sufficiently realistic image regardless of prompt will virtually always do better on this question than a model that does take the prompt into account during image generation. We propose this type of question is only applicable between models of similar alignment. Since Muse is significantly better aligned than Stable Diffusion, we did not assess realism via human raters. We consider this topic an area of open research.

2.2 Inference Speed

In Table 3, we compare the inference time of Muse to several other popular models. We benchmarked Parti-3B, Imagen, and Muse-3B internally on TPUv4 accelerators. For Stable Diffusion/LDM, we used the fastest reported benchmark (Lambda Labs, 2022), which was done on A100 GPUs. For Stable Diffusion, the TPU implementations we tested were not faster than the A100 implementation. We also report an inference time for LDM with 250 iterations, which is the configuration used to achieve the FID in Table 2. Muse is significantly faster than competing diffusion or autoregressive models, despite having comparable parameter counts (and around 3x more parameters than Stable Diffusion/LDM). The speed advantage of Muse over Imagen is due to the use of discrete tokens and requiring fewer sampling iterations. The speed advantage of Muse over Parti is due to the use of parallel decoding. The speed advantage of Muse over Stable Diffusion is primarily attributable to requiring fewer sampling iterations.

3 Image Editing

By exploiting the fact that our model can condition on arbitrary subsets of image tokens, we can use the model out-of-the-box for a variety of image editing applications with no additional training or model fine-tuning.

Our sampling procedure (Section 2.8) gives us text-guided inpainting and outpainting for free: we convert an input image into a set of tokens, mask out the tokens corresponding to a local region, and then sample the masked tokens conditioned on unmasked tokens and a text prompt. We integrate superresolution through a multi-scale approach: Given an image of size 512x512, we first decimate it to 256x256 and convert both images to high- and low-res tokens. Then, we mask out the appropriate regions for each set of tokens. Next, we inpaint the low-res tokens using the parallel sampling algorithm. Finally, we condition on these low-res tokens to inpaint the high-res tokens using the same sampling algorithm. We show examples of this in Figure 2 and Figure 10.

3.2 Zero-shot Mask-free editing

We use Muse in a zero-shot sense for mask-free image editing of real input images. This method works directly on the (tokenized) image and does not require “inverting” the full generative process, in contrast with recent zero-shot image editing techniques leveraging generative models (Gal et al., 2022b; Patashnik et al., 2021; Kim et al., 2022; Mokady et al., 2022).

We first convert an input image into visual tokens. Next, we iteratively mask and resample a random subset of tokens, conditioned on text prompts. We can think of this as being analogous to a Gibbs sampling procedure, where we fix some tokens and resample others conditioned on them. This has the effect of moving the tokenized image into the typical set of the conditional distribution of images given a text prompt.

We perform the editing using the low-resolution base model, then perform superres on the final output (conditioned on the editing prompt). In the examples (Figure 2, Figure 11), we resample 8% of the tokens per iteration for 100 iterations, with a guidance scale of 4. We also perform top-kk (k=3k=3) sampling on the token logits to prevent the process from diverging too much from the input. The iterative nature allows for control over the final output. Figure 12 shows a few intermediate edits (without superres); in this example, the user may prefer iteration 50 or 75 over the final output.

Related Work

Variational autoencoders (Van Den Oord et al., 2017) and Generative Adversarial Models (GANs) have shown excellent image generation performance with many variants proposed for both convolutional and Transformer architectures e.g. (Goodfellow et al., 2020; Esser et al., 2021b; Karras et al., 2019; Brock et al., 2018; Donahue & Simonyan, 2019). Until recently, GANs were considered state of the art. Diffusion models, based on progressive denoising principles, are now able to synthesize images and video at equal or higher fidelity (Ho et al., 2020; Kingma et al., 2021; Ho et al., 2022). Hybrid approaches that combine principles from multiple approaches have also shown excellent performance (Chang et al., 2022; Lezama et al., 2022), suggesting that there are more complementarities between approaches that can be exploited.

2 Image Tokenizers

Image tokenizers are proving to be useful for multiple generative models due to the ability to move the bulk of the computation from input (pixel) space to latents (Rombach et al., 2022), or to enabling more effective loss functions such as classification instead of regression (Chang et al., 2022; Lezama et al., 2022; Li et al., 2022). A number of tokenization approaches such as Discrete VAE’s (Rolfe, 2016), VQVAE (Van Den Oord et al., 2017) and VQGAN (Esser et al., 2021b) have been developed, with the latter being the highest-performing as it combines perceptual and adversarial losses to achieve excellent reconstruction. ViT-VQGAN (Yu et al., 2021) extends VQGAN to the Transformer architecture. We use VQGAN rather than ViT-VQGAN as we found it to perform better for our model, noting that a better performing tokenization model does not always translate to a better performing text-to-image model.

3 Large Language Models

Our work leverages T5, a pre-trained large language model (LLM) that has been trained on multiple text-to-text tasks (Raffel et al., 2020). LLMs (including T5, BERT (Devlin et al., 2018), and GPT (Brown et al., 2020; Radford et al., 2019)) have been shown to learn powerful embeddings which enable few-shot transfer learning. We leverage this capacity in our model. All of the modern LLMs are trained on token prediction tasks (either autoregressive or not). The insights regarding the power of token prediction is leveraged in this work, where we apply a transformer to predict visual tokens.

4 Text-Image Models

Leveraging paired text-image data is proving to be a powerful learning paradigm for representation learning and generative models. CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) train models to align pairs of text and image embeddings, showing excellent transfer and few-shot capabilities. Imagen (Saharia et al., 2022) and Parti (Yu et al., 2022) use similar large scale text-image datasets (Schuhmann et al., 2021, 2022) to learn how to predict images from text inputs, achieving excellent results on FID and human evaluations. A key trick is the use of classifier-free guidance (Ho & Salimans, 2022; Dhariwal & Nichol, 2021) that trades off diversity and quality.

5 Image Editing with Generative Models

GANs have been extensively studied for image editing and manipulation capabilities (see (Xia et al., 2022) for a survey). A number of techniques have been developed on diffusion models to enable editing, personalization and inversion to token space (Gal et al., 2022a; Meng et al., 2021; Ruiz et al., 2022; Kawar et al., 2022; Brooks et al., 2022; Hertz et al., 2022; Mokady et al., 2022). Dreambooth (Ruiz et al., 2022) and Imagic (Kawar et al., 2022) involve fine-tuning of the generative models. ImagenEditor (Wang et al., 2022) frames the editing task as text-guided image inpainting, and involves user specified masks.

Discussion and Social Impact

The Muse model confirms the findings of (Saharia et al., 2022) that frozen large pretrained language models serve as powerful text encoders for text-to-image generation. We also tried in our initial experiments to learn a language model from scratch on the training data, but found that performance was significantly worse than using a pre-trained LLM, especially on long prompts and rare words. We also show that non-diffusion, non-autoregressive models based on the Transformer architecture can perform at par with diffusion models while being significantly more efficient at inference time. We achieve SOTA CLIP scores, showing an excellent alignment beteween image and text. We also show the flexibility of our approach with a number of image editing applications.

We recognize that generative models have a number of applications with varied potential for impact on human society. Generative models (Saharia et al., 2022; Yu et al., 2022; Rombach et al., 2022; Midjourney, 2022) hold significant potential to augment human creativity (Hughes et al., 2021). However, it is well known that they can also be leveraged for misinformation, harassment and various types of social and cultural biases (Franks & Waldman, 2018; Whittaker et al., 2020; Srinivasan & Uchino, 2021; Steed & Caliskan, 2021). Due to these important considerations, we opt to not release code or a public demo at this point in time.

Dataset biases are another important ethical consideration due to the requirement of large datasets that are mostly automatically curated. Such datasets have various potentially problematic issues such as consent and subject awareness (Paullada et al., 2021; Dulhanty, 2020; Scheuerman et al., 2021). Many of the commonly used datasets tend to reflect negative social stereotypes and viewpoints (Prabhu & Birhane, 2020). Thus, it is quite feasible that training on such datasets simply amplifies these biases and significant additional research is required on how to mitigate such biases, and generate datasets that are free of them: this is a very important topic (Buolamwini & Gebru, 2018; Hendricks et al., 2018) that is out of the scope of this paper.

Given the above considerations, we do not recommend the use of text-to-image generation models without attention to the various use cases and an understanding of the potential for harm. We especially caution against using such models for generation of people, humans and faces.

Acknowledgements

We thank William Chan, Chitwan Saharia, and Mohammad Norouzi for providing us training datasets, various evaluation codes and generous suggestions. Jay Yagnik, Rahul Sukthankar, Tom Duerig and David Salesin provided enthusiastic support of this project for which we are grateful. We thank Victor Gomes and Erica Moreira for infrastructure support, Jing Yu Koh and Jason Baldridge for dataset, model and evaluation discussions and feedback on the paper, Mike Krainin for model speedup discussions, JD Velasquez for discussions and insights, Sarah Laszlo, Kathy Meier-Hellstern, and Rachel Stigler for assisting us with the publication process, Andrew Bunner, Jordi Pont-Tuset, and Shai Noy for help on internal demos, David Fleet, Saurabh Saxena, Jiahui Yu, and Jason Baldridge for sharing Imagen and Parti speed metrics.

References

Appendix A Appendix.

Our base model configuration for our largest model of size 3B parameters is given in Table 4.

A.2 VQGAN Configurations

VQGAN Architecture: Our VQGAN architecture is similar to the previous work (Esser et al., 2021b). It consists of several residual blocks, downsample(encoder) and upsample (decoder) blocks. The main difference is that we remove the non-local block to make the encoder and decoder fully convolutional to support different image sizes. In the base VQGAN model, we apply 2 residual blocks in each resolution and the base channel dimension is 128. For the finetuned decoder, we apply 4 residual blocks in each resolution and we also make the base channel dimension to be 256.

A.3 Super Resolution Configurations