Vector Quantized Diffusion Model for Text-to-Image Synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo

Introduction

Recent success of Transformer in neural language processing (NLP) has raised tremendous interest in using successful language models for computer vision tasks. Autoregressive (AR) model is one of the most natural and popular approach to transfer from text-to-text generation (i.e., machine translation) to text-to-image generation. Based on the AR model, recent work DALL-E has achieved impressive results for text-to-image generation.

Despite their success, existing text-to-image generation methods still have weaknesses that need to be improved. One issue is the unidirectional bias. Existing methods predict pixels or tokens in the reading order, from top-left to bottom-right, based on the attention to all prefix pixels/tokens and the text description. This fixed order introduces unnatural bias in the synthesized images because important contextual information may come from any part of the image, not just from left or above. Another issue is the accumulated prediction errors. Each step of the inference stage is performed based on previously sampled tokens – this is different from that of the training stage, which relies on the so-called “teacher-forcing” practice and provides the ground truth for each step. This difference is important and its consequence merits careful examination. In particular, a token in the inference stage, once predicted, cannot be corrected and its errors will propagate to the subsequent tokens.

We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation, a model that eliminates the unidirectional bias and avoids accumulated prediction errors. We start with a vector quantized variational autoencoder (VQ-VAE) and model its latent space by learning a parametric model using a conditional variant of the Denoising Diffusion Probabilistic Model (DDPM) , which has been applied to image synthesis with compelling results . We show that the latent-space model is well-suited for the task of text-to-image generation. Roughly speaking, the VQ-Diffusion model samples the data distribution by reversing a forward diffusion process that gradually corrupts the input via a fixed Markov chain. The forward process yields a sequence of increasingly noisy latent variables of the same dimensionality as the input, producing pure noise after a fixed number of timesteps. Starting from this noise result, the reverse process gradually denoises the latent variables towards the desired data distribution by learning the conditional transit distribution.

The VQ-Diffusion model eliminates the unidirectional bias. It consists of an independent text encoder and a diffusion image decoder, which performs denoising diffusion on discrete image tokens. At the beginning of the inference stage, all image tokens are either masked or random. Here the masked token serves the same function as those in mask-based generative models . The denoising diffusion process gradually estimates the probability density of image tokens step-by-step based on the input text. In each step, the diffusion image decoder leverages the contextual information of all tokens of the entire image predicted in the previous step to estimate a new probability density distribution and use this distribution to predict the tokens in the current step. This bidirectional attention provides global context for each token prediction and eliminates the unidirectional bias.

The VQ-Diffusion model, with its mask-and-replace diffusion strategy, also avoids the accumulation of errors. In the training stage, we do not use the “teacher-forcing” strategy. Instead, we deliberately introduce both masked tokens and random tokens and let the network learn to predict the masked token and modify incorrect tokens. In the inference stage, we update the density distribution of all tokens in each step and resample all tokens according to the new distribution. Thus we can modify the wrong tokens and prevent error accumulation. Comparing to the conventional replace-only diffusion strategy for unconditional image generation , the masked tokens effectively direct the network’s attention to the masked areas and thus greatly reduce the number of token combinations to be examined by the network. This mask-and-replace diffusion strategy significantly accelerates the convergence of the network.

To assess the performance of the VQ-Diffusion method, we conduct text-to-image generation experiments with a wide variety of datasets, including CUB-200 , Oxford-102 , and MSCOCO . Compared with AR model with similar numbers of model parameters, our method achieves significantly better results, as measured by both image quality metrics and visual examination, and is much faster. Compared with previous GAN-based text-to-image methods , our method can handle more complex scenes and the synthesized image quality is improved by a large margin. Compared with extremely large models (models with ten times more parameters than ours), including DALL-E and CogView , our model achieves comparable or better results for specific types of images, i.e., the types of images that our model has seen during the training stage. Furthermore, our method is general and produces strong results in our experiments on both unconditional and conditional image generation with FFHQ and ImageNet datasets.

The VQ-Diffusion model also provides important benefits for the inference speed. With traditional AR methods, the inference time increases linearly with the output image resolution and the image generation is quite time consuming even for normal-size images (e.g., images larger than small thumbnail images of $64\times 64$ pixels). The VQ-Diffusion provides the global context for each token prediction and makes it independent of the image resolution. This allows us to provide an effective way to achieve a better tradeoff between the inference speed and the image quality by a simple reparameterization of the diffusion image decoder. Specifically, in each step, we ask the decoder to predict the original noise-free image instead of the noise-reduced image in the next denoising diffusion step. Through experiments we have found that the VQ-Diffusion method with reparameterization can be fifteen times faster than AR methods while achieving a better image quality.

Related Work

GAN-based Text-to-image generation. In the past few years, Generative Adversarial Networks (GANs) have shown promising results on many tasks ,, especially text-to-image generation . GAN-INT-CLS was the first to use a conditional GAN formulation for text-to-image generation. Based on this formulation, some approaches were proposed to further improve the generation quality. These models generate high fidelity images on single domain datasets, e.g., birds and flowers . However, due to the inductive bias on the locality of convolutional neural networks, they struggle on complex scenes with multiple objects, such as those in the MS-COCO dataset .

Other works adopt a two-step process which first infer the semantic layout then generate different objects, but this kind of method requires fine-grained object labels, e.g., object bounding boxes or segmentation maps.

Autoregressive Models. AR models have shown powerful capability of density estimation and have been applied for image generation recently. PixelRNN , Image Transformer and ImageGPT factorized the probability density on an image over raw pixels. Thus, they only generate low-resolution images, like $64\times 64$ , due to the unaffordable amount of computation for large images.

VQ-VAE , VQGAN and ImageBART train an encoder to compress the image into a low-dimensional discrete latent space and fit the density of the hidden variables. It greatly improves the performance of image generation.

DALL-E , CogView and M6 propose AR-based text-to-image frameworks. They model the joint distribution of text and image tokens. With powerful large transformer structure and massive text-image pairs, they greatly advance the quality of text-to-image generation, but still have weaknesses of unidirectional bias and accumulated prediction errors due to the limitation of AR models.

Denoising Diffusion Probabilistic Models. Diffusion generative models were first proposed in and achieved strong results on image generation and image super super-resolution recently. However, most previous works only considered continuous diffusion models on the raw image pixels. Discrete diffusion models were also first described in , and then applied to text generation in Argmax Flow . D3PMs applies discrete diffusion to image generation. However, it also estimates the density of raw image pixels and can only generate low-resolution (e.g., $32\times 32$ ) images.

Background: Learning Discrete Latent Space of Images Via VQ-VAE

Transformer architectures have shown great promise in image synthesis due to their outstanding expressivity . In this work, we aim to leverage the transformer to learn the mapping from text to image. Since the computation cost is quadratic to the sequence length, it is computationally prohibitive to directly model raw pixels using transformers. To address this issue, recent works propose to represent an image by discrete image tokens with reduced sequence length. Hereafter a transformer can be effectively trained upon this reduced context length and learn the translation from the text to image tokens.

Where, $\text{sg}[\cdot]$ stands for the stop-gradient operation. In practice, we replace the second term of Equation 2 with exponential moving averages (EMA) to update the codebook entries which is proven to work better than directly using the loss function.

Vector Quantized Diffusion Model

Previous autoregressive models, e.g., DALL-E and CogView , sequentially predict each image token depends on the text tokens as well as the previously predicted image tokens, i.e., $q(\bm{x}|\bm{y})=\prod_{i=1}^{N}q({x}^{i}|x^{1},\cdots,x^{i-1},\bm{y})$ . While achieving remarkable quality in text-to-image synthesis, there exist several limitations of autoregressive modeling. First, image tokens are predicted in a unidirectional ordering, e.g., raster scan, which neglects the structure of 2D data and restricts the expressivity for image modeling since the prediction of a specific location should not merely attend to the context on the left or the above. Second, there is a train-test discrepancy as the training employs ground truth whereas the inference relies on the prediction as previous tokens. The so-called “teacher-forcing” practice or exposure bias leads to error accumulation due to the mistakes in the earlier sampling. Moreover, it requires a forward pass of the network to predict each token, which consumes an inordinate amount of time even for the sampling in the latent space of low resolution (i.e., $32\times 32$ ), making the AR model impractical for real usage.

We aim to model the VQ-VAE latent space in a non-autoregressive manner. The proposed VQ-Diffusion method maximizes the probability $q(\bm{x}|\bm{y})$ with the diffusion model , an emerging approach that produces compelling quality on image synthesis . While the majority of recent works focus on continuous diffusion models, using them for categorical distribution is much less researched . In this work, we propose to use its conditional variant discrete diffusion process for text-to-image generation. We will subsequently introduce the discrete diffusion process inspired by the masked language modeling (MLM) , and then discuss how to train a neural network to reverse this process.

On a high level, the forward diffusion process gradually corrupts the image data $\bm{x}_{0}$ via a fixed Markov chain $q(\bm{x}_{t}|\bm{x}_{t-1})$ , e.g., random replace some tokens of $\bm{x}_{t-1}$ . After a fixed number of $T$ timesteps, the forward process yields a sequence of increasingly noisy latent variables $\bm{z}_{1},...,\bm{z}_{T}$ of the same dimensionality as $\bm{z}_{0}$ , and $\bm{z}_{T}$ becomes pure noise tokens. Starting from the noise $\bm{z}_{T}$ , the reverse process gradually denoises the latent variables and restore the real data $\bm{x}_{0}$ by sampling from the reverse distribution $q(\bm{x}_{t-1}|\bm{x}_{t},\bm{x}_{0})$ sequentially. However, since $\bm{x}_{0}$ is unknown in the inference stage, we train a transformer network to approximate the conditional transit distribution $p_{\theta}(\bm{x}_{t-1}|\bm{x}_{t},\bm{y})$ depends on the entire data distribution.

where $\bm{v}(x)$ is a one-hot column vector which length is $K$ and only the entry $x$ is 1. The categorical distribution over $x_{t}$ is given by the vector $\bm{Q}_{t}\bm{v}(x_{t-1})$ .

Importantly, due to the property of Markov chain, one can marginalize out the intermediate steps and derive the probability of $x_{t}$ at arbitrary timestep directly from $x_{0}$ as,

Besides, another notable characteristic is that by conditioning on $\bm{z}_{0}$ , the posterior of this diffusion process is tractable, i.e.,

The transition matrix $\bm{Q}_{t}$ is crucial to the discrete diffusion model and should be carefully designed such that it is not too difficult for the reverse network to recover the signal from noises.

Previous works propose to introduce a small amount of uniform noises to the categorical distribution and the transition matrix can be formulated as,

with $\alpha_{t}\in$ and $\beta_{t}=(1-\alpha_{t})/K$ . Each token has a probability of $(\alpha_{t}+\beta_{t})$ to remain the previous value at the current step while with a probability of $K\beta_{t}$ to be resampled uniformly over all the $K$ categories.

Nonetheless, the data corruption using uniform diffusion is a somewhat aggressive process that may pose challenge for the reverse estimation. First, as opposed to the Gaussian diffusion process for ordinal data, an image token may be replaced to an utterly uncorrelated category, which leads to an abrupt semantic change for that token. Second, the network has to take extra efforts to figure out the tokens that have been replaced prior to fixing them. In fact, due to the semantic conflict within the local context, the reverse estimation for different image tokens may form a competition and run into the dilemma of identifying the reliable tokens.

The benefit of this mask-and-replace transition is that: 1) the corrupted tokens are distinguishable to the network, which eases the reverse process. 2) Comparing to the mask only approach in , we theoretically prove that it is necessary to include a small amount of uniform noises besides the token masking, otherwise we get a trivial posterior when $x_{t}\neq x_{0}$ . 3) The random token replacement forces the network to understand the context rather than only focusing on the $[\text{\tt{MASK}}]$ tokens. 4) The cumulative transition matrix $\overline{\bm{Q}}_{t}$ and the probability $q(x_{t}|x_{0})$ in Equation 4 can be computed in closed form with:

Where $\overline{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ , $\overline{\gamma}_{t}=1-\prod_{i=1}^{t}(1-\gamma_{i})$ , and $\overline{\beta}_{t}=(1-\overline{\alpha}_{t}-\overline{\gamma}_{t})/K$ can be calculated and stored in advance. Thus, the computation cost of $q(x_{t}|x_{0})$ is reduced from $O(tK^{2})$ to $O(K)$ . The proof is given in the supplemental material.

2 Learning the reverse process

To reverse the diffusion process, we train a denoising network $p_{\theta}(\bm{x}_{t-1}|\bm{x}_{t},\bm{y})$ to estimate the posterior transition distribution $q(\bm{x}_{t-1}|\bm{x}_{t},\bm{x}_{0})$ . The network is trained to minimize the variational lower bound (VLB) :

Where $p(\bm{x}_{T})$ is the prior distribution of timestep $T$ . For the proposed mask-and-replace diffusion, the prior is:

Note that since the transition matrix $\bm{Q}_{t}$ is fixed in the training, the $\mathcal{L}_{T}$ is a constant number which measures the gap between the training and inference and can be ignored in the training.

Based on the reparameterization trick, we can introduce an auxiliary denoising objective, which encourages the network to predict noiseless token $x_{0}$ :

We find that combining this loss with $\mathcal{L}_{vlb}$ improves the image quality.

Fast inference strategy In the inference stage, by leveraging the reparameterization trick, we can skip some steps in diffusion model to achieve a faster inference.

Specifically, assuming the time stride is $\Delta_{t}$ , instead of sampling images in the chain of $x_{T},x_{T-1},x_{T-2}...x_{0}$ , we sample images in the chain of $x_{T},x_{T-\Delta_{t}},x_{T-2\Delta_{t}}...x_{0}$ with the reverse transition distribution:

We found it makes the sampling more efficient which only causes little harm to quality. The whole training and inference algorithm is shown in Algorithm 1 and 2.

Experiments

In this section, we first introduce the overall experiment setups and then present extensive results to demonstrate the superiority of our approach in text-to-image synthesis. Finally, we point out that our method is a general image synthesis framework that achieves great performance on other generation tasks, including unconditional and class conditional image synthesis.

Datasets. To demonstrate the capability of our proposed method for text-to-image synthesis, we conduct experiments on CUB-200 , Oxford-102 , and MSCOCO datasets. The CUB-200 dataset contains 8855 training images and 2933 test images belonging to 200 bird species. Oxford-102 dataset contains 8189 images of flowers of $102$ categories. Each image in CUB-200 and Oxford-102 dataset contains 10 text descriptions. MSCOCO dataset contains $82k$ images for training and $40k$ images for testing. Each image in this dataset has five text descriptions.

To further demonstrate the scalability of our method, we also train our model on large scale datasets, including Conceptual Captions and LAION-400M . The Conceptual Caption dataset, including both CC3M and CC12M datasets, contains 15M images. To balance the text and image distribution, we filter a 7M subset according to the word frequency. The LAION-400M dataset contains 400M image-text pairs. We train our model on three subsets from LAION, i.e., cartoon, icon, and human, each of them contains 0.9M, 1.3M, 42M images, respectively. For each subset, we filter the data according to the text.

Traning Details. Our VQ-VAE’s encoder and decoder follow the setting of VQGAN which leverages the GAN loss to get a more realistic image. We directly adopt the publicly available VQGAN model trained on OpenImages dataset for all text-to-image synthesis experiments. It converts $256\times 256$ images into $32\times 32$ tokens. The codebook size $K=2886$ after removing useless codes. We adopt a publicly available tokenizer of the CLIP model as text encoder, yielding a conditional sequence of length 77. We fix both image and text encoders in our training.

For fair comparison with previous text-to-image methods under similar parameters, we build two different diffusion image decoder settings: 1) VQ-Diffusion-S (Small), it contains $18$ transformer blocks with dimension of $192$ . The model contains $34M$ parameters. 2) VQ-Diffusion-B (Base), it contains $19$ transformer blocks with dimension of $1024$ . The model contains $370M$ parameters.

In order to show the scalability of our method, we also train our base model on a larger database Conceptual Captions, and then fine-tune it on each database. This model is denoted as VQ-Diffusion-F.

For the default setting, we set timesteps $T=100$ and loss weight $\lambda=0.0005$ . For the transition matrix, we linearly increase $\overline{\gamma}_{t}$ and $\overline{\beta}_{t}$ from to $0.9$ and $0.1$ , respectively. We optimize our network using AdamW with $\beta_{1}=0.9$ and $\beta_{2}=0.96$ . The learning rate is set to $0.00045$ after 5000 iterations of warmup. More training details are provided in the appendix.

We qualitatively compare the proposed method with several state-of-the-art methods, including some GAN-based methods , DALL-E and CogView , on MSCOCO, CUB-200 and Oxford-102 datasets. We calculate the FID between $30k$ generated images and $30k$ real images, and show the results in Table 1.

We can see that our small model, VQ-Diffusion-S, which has the similar parameter number with previous GAN-based models, has strong performance on CUB-200 and Oxford-102 datasets. Our base model, VQ-Diffusion-B, further improves the performance. And our VQ-Diffusion-F model achieves the best results and surpasses all previous methods by a large margin, even surpassing DALL-E and CogView , which have ten times more parameters than ours, on MSCOCO dataset.

Some visualized comparison results with DM-GAN and DF-GAN are shown in Figure 2. Obviously, our synthesized images have better realistic fine-grained details and are more consistent with the input text.

2 In the wild text-to-image synthesis

To demonstrate the capability of generating in-the-wild images, we train our model on three subsets from LAION-400M dataset, e.g., cartoon, icon and human. We provide our results here in Figure 3. Though our base model is much smaller than previous works like DALL-E and CogView, we also achieved a strong performance.

Compared with the AR method which generates images from top-left to down-right, our method generates images in a global manner. It makes our method can be applied to many vision tasks, e.g., irregular mask inpainting. For this task, we do not need to re-train a new model. We simply set the tokens in the irregular region as [MASK] token, and send them to our model. This strategy supports both unconditional mask inpainting and text conditional mask inpainting. We show these results in the appendix.

3 Ablations

Number of timesteps. We investigate the timesteps in training and inference. As shown in Table 2, we perform the experiment on the CUB-200 dataset. We find when the training steps increase from $10$ to $100$ , the result improves, when it further increase to $200$ , it seems saturated. So we set the default timesteps number to $100$ in our experiments. To demonstrate the fast inference strategy, we evaluate the generated images from $10,25,50,100$ inference steps on five models with different training steps. We find it still maintains a good performance when dropping $3/4$ inference steps, which may save about $3/4$ inference times.

Mask-and-replace diffusion strategy. We explore how the mask-and-replace strategy benefits our performance on the Oxford-102 dataset. We set different final mask rate ( $\overline{\gamma}_{T}$ ) to investigate the effect. Both mask only strategy ( $\overline{\gamma}_{T}=1$ ) and replace only strategy ( $\overline{\gamma}_{T}=0$ ) are special cases of our mask-and-replace strategy. From Figure 4, we find it get the best performance when $M=0.9$ . When $M>0.9$ , it may suffer from the error accumulation problem, when $M<0.9$ , the network may be difficult to find which region needs to pay more attention.

VQ-Diffusion vs VQ-AR. For a fair comparison, we replace our diffusion image decoder with an autoregressive decoder with the same network structure and keep other settings the same, including both image and text encoders. The autoregressive model is denoted as VQ-AR-S and VQ-AR-B, corresponding to VQ-Diffusion-S and VQ-Diffusion-B. The experiment is performed on the CUB-200 dataset. As shown in Table 3 , on both -S and -B settings the VQ-Diffusion model surpasses the VQ-AR model by a large margin. Meanwhile, we evaluate the throughput of both methods on a V100 GPU with a batch size of 32. The VQ-Diffusion with the fast inference strategy is $15$ times faster than the VQ-AR model with a better FID score.

4 Unified generation model

Our method is general, which can also be applied to other image synthesis tasks, e.g., unconditional image synthesis and image synthesis conditioned on labels. To generate images from a given class label, we first remove the text encoder network and cross attention part in transformer blocks, and inject the class label through the AdaLN operator. Our network contains $24$ transformer blocks with dimension $512$ . We train our model on the ImageNet dataset. For VQ-VAE, we adopt the publicly available model from VQ-GAN trained on ImageNet dataset, which downsamples images from $256\times 256$ to $16\times 16$ . For unconditional image synthesis, we trained our model on the FFHQ256 dataset, which contains 70k high quality face images. The image encoder also downsamples images to $16\times 16$ tokens.

We assess the performance of our model in terms of FID and compare with a variety of previously established models . For a fair comparison, we calculate FID between $50k$ generated images and all real images. Following we can further increase the quality by only accepting images with a top $5\%$ classification score, denoted as acc0.05. We show the quantitative results in Table 4. While some task-specialized GAN models report better FID scores, our approach provides a unified model that works well across a wide range of tasks.

Conclusion

In this paper, we present a novel text-to-image architecture named VQ-Diffusion. The core design is to model the VQ-VAE latent space in a non-autoregressive manner. The proposed mask-and-replace diffusion strategy avoids the accumulation of errors of the AR model. Our model has the capacity to generate more complex scenes, which surpasses previous GAN-based text-to-image methods. Our method is also general and produces strong results on unconditional and conditional image generation.

Acknowledgement

We thank Qiankun Liu from University of Science and Technology of China for his help, he provided the initial code and datasets.

References

Appendix A Implementation details

In our experiments on text-to-image synthesis, we adopt the public VQ-VAE model provided by VQGAN trained on the OpenImages dataset, which downsamples images from $256\times 256$ to $32\times 32$ . We use the CLIP pretrained model (ViT-B) as our text encoder, which encodes a sentence to $77$ tokens. Our diffusion image decoder consists of several transformer blocks, each block contains full attention, cross attention, and feed forward network(FFN). Our base model contains $19$ transformer blocks, the channel of each block is $1024$ . The FFN contains two linear layer, which expand the dimension to $4096$ in the middle layer. The model contains $370$ M parameters. For our small model, it contains $18$ transformer blocks while the channel is $192$ , the FFN contains two convolution layers with kernel size $3$ , the channel expand rate is $2$ . The model contains $34$ M parameters.

For our class conditional generation model on ImageNet, we adopt the public VQ-VAE model provided by VQGAN trained on ImageNet, which downsamples images from $256\times 256$ to $16\times 16$ . Our model contains $24$ transformer blocks, each block contains a full attention layer and a FFN. The base channel number is $512$ . Besides, the FFN also uses convolution instead of linear layer, and the channel expand rate is $4$ .

Appendix B Proof of Equation 8

Mathematical induction can be used to prove the Equation 8 in the paper.

which is clearly hold. Suppose the Equation 8 is hold at step $t$ , then for $t=t+1$ :

Appendix C Results

In this part, we provide more visualization results. First, we compare our results with XMC-GAN in Figure 7. We got their results directly from their paper. The irregular mask inpainting results are shown in Figure 5. we show our more in the wild text-to-image results in Figure 6. And we provide our results on ImageNet and FFHQ in Figure 9 and Figure 8.