NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan

Introduction

Nowadays, the Web is becoming more visual than ever before, as images and videos have become the new information carriers and have been used in many practical applications. With this background, visual synthesis is becoming a more and more popular research topic, which aims to build models that can generate new or manipulate existing visual data (i.e., images and videos) for various visual scenarios.

Auto-regressive models play an important role in visual synthesis tasks, due to their explicit density modeling and stable training advantages compared with GANs. Earlier visual auto-regressive models, such as PixelCNN, PixelRNN, Image Transformer, iGPT, and Video Transformer, performed visual synthesis in a “pixel-by-pixel” manner. However, due to their high computational cost on high-dimensional visual data, such methods can be applied to low-resolution images or videos only and are hard to scale up.

Recently, with the arise of VQ-VAE as a discrete visual tokenization approach, efficient and large-scale pre-training can be applied to visual synthesis tasks for images (e.g., DALL-E and CogView) and videos (e.g., GODIVA). Although achieving great success, such solutions still have limitations – they treat images and videos separately and focus on generating either of them. This limits the models to benefit from both image and video data.

In this paper, we present NÜWA, a unified multimodal pre-trained model that aims to support visual synthesis tasks for both images and videos, and conduct experiments on 8 downstream visual synthesis, as shown in Fig. 1. The main contributions of this work are three-fold:

We propose NÜWA, a general 3D transformer encoder-decoder framework, which covers language, image, and video at the same time for different visual synthesis tasks. It consists of an adaptive encoder that takes either text or visual sketch as input, and a decoder shared by 8 visual synthesis tasks.

We propose a 3D Nearby Attention (3DNA) mechanism in the framework to consider the locality characteristic for both spatial and temporal axes. 3DNA not only reduces computational complexity but also improves the visual quality of the generated results.

Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, NÜWA shows surprisingly good zero-shot capabilities not only on text-guided image manipulation, but also text-guided video manipulation.

Related Works

The method proposed in this paper follows the line of visual synthesis research based on auto-regressive models. Earlier visual auto-regressive models performed visual synthesis in a “pixel-by-pixel” manner. However, due to the high computational cost when modeling high-dimensional data, such methods can be applied to low-resolution images or videos only, and are hard to scale up.

Recently, VQ-VAE-based visual auto-regressive models were proposed for visual synthesis tasks. By converting images into discrete visual tokens, such methods can conduct efficient and large-scale pre-training for text-to-image generation (e.g., DALL-E and CogView), text-to-video generation (e.g., GODIVA), and video prediction (e.g., LVT and VideoGPT), with higher resolution of generated images or videos. However, none of these models was trained by images and videos together. But it is intuitive that these tasks can benefit from both types of visual data.

Compared to these works, NÜWA is a unified auto-regressive visual synthesis model that is pre-trained by the visual data covering both images and videos and can support various downstream tasks. We also verify the effectiveness of different pretraining tasks in Sec. 4.3. Besides, VQ-GAN instead of VQ-VAE is used in NÜWA for visual tokenization, which, based on our experiment, can lead to better generation quality.

2 Visual Sparse Self-Attention

How to deal with the quadratic complexity issue brought by self-attention is another challenge, especially for tasks like high-resolution image synthesis or video synthesis.

Similar to NLP, sparse attention mechanisms have been explored to alleviate this issue for visual synthesis. split the visual data into different parts (or blocks) and then performed block-wise sparse attention for the synthesis tasks. However, such methods dealt with different blocks separately and did not model their relationships. proposed to use axial-wise sparse attention in visual synthesis tasks, which conducts sparse attention along the axes of visual data representations. This mechanism makes training very efficient and is friendly to large-scale pre-trained models like DALL-E, CogView, and GODIVA. However, the quality of generated visual contents could be harmed due to the limited contexts used in self-attention. proposed to use local-wise sparse attention in visual synthesis tasks, which allows the models to see more contexts. But these works were for images only.

Compared to these works, NÜWA proposes a 3D nearby attention that extends the local-wise sparse attention to cover both images to videos. We also verify that local-wise sparse attention is superior to axial-wise sparse attention for visual generation in Sec. 4.3.

Method

where $||I-\hat{I}||_{2}^{2}$ strictly constraints the exact pixel match between $I$ and $\hat{I}$ , which limits the generalization ability of the model. Recently, VQ-GAN enhanced VQ-VAE training by adding a perceptual loss and a GAN loss to ease the exact constraints between $I$ and $\hat{I}$ and focus on high-level semantic matching, as denoted in Eq. (4) $\sim$ (5):

2 3D Nearby Self-Attention

In this section, we define a unified 3D Nearby Self-Attention (3DNA) module based on the previous 3D data representations, supporting both self-attention and cross-attention. We first give the definition of 3DNA in Eq. (6), and introduce detailed implementation in Eq. (7) $\sim$ (11):

where the $(i,j,k)$ position queries and collects corresponding nearby information in $C$ . This also handles $C=X$ , then $(i,j,k)$ just queries the nearby position of itself. 3NDA not only reduces the complexity of full attention from $O\left(\left(hws\right)^{2}\right)$ to $O\left(\left(hws\right)\left(e^{h}e^{w}e^{s}\right)\right)$ , but also shows superior performance and we discuss it in Sec. 4.3.

3 3D Encoder-Decoder

Then, the condition $C$ is fed into an encoder with a stack of $L$ 3DNA layers to model the self-attention interactions, with the $l$ th layer denoted in Eq. (14):

Similarly, the decoder is also a stack of $L$ 3DNA layers. The decoder calculates both self-attention of generated results and cross-attention between generated results and conditions. The $l$ th layer is denoted in Eq. (15).

where $<i,<j,<k$ denote the generated tokens for now. The initial token $V_{0,0,0}^{(1)}$ is a special $<bos>$ token learned during the training phase.

4 Training Objective

We train our model on three tasks, Text-to-Image (T2I), Video Prediction (V2V) and Text-to-Video (T2V). The training objective for the three tasks are cross-entropys denoted as three parts in Eq. (16), respectively:

For T2I and T2V tasks, $C^{text}$ denotes text conditions. For the V2V task, since there is no text input, we instead get a constant 3D representation $c$ of the special word “None”. $\theta$ denotes the model parameters.

Experiments

Based on Sec. 3.4 we first pre-train NÜWA on three datasets: Conceptual Captions for text-to-image (T2I) generation, which includes 2.9M text-image pairs, Moments in Time for video prediction (V2V), which includes 727K videos, and VATEX dataset for text-to-video (T2V) generation, which includes 241K text-video pairs. In the following, we first introduce implementation details in Sec. 4.1 and then compare NÜWA with state-of-the-art models in Sec. 4.2, and finally conduct ablation studies in Sec. 4.3 to study the impacts of different parts.

In Sec. 3.1, we set the sizes of 3D representations for text, image, and video as follows. For text, the size of 3D representation is $1\times 1\times 77\times 1280$ . For image, the size of 3D representation is $21\times 21\times 1\times 1280$ . For video, the size of 3D representation is $21\times 21\times 10\times 1280$ , where we sample 10 frames from a video with 2.5 fps. Although the default visual resolution is $336\times 336$ , we pre-train different resolutions for a fair comparison with existing models. For the VQ-GAN model used for both images and videos, the size of grid feature $E(I)$ in Eq. (1) is $441\times 256$ , and the size of the codebook $B$ is $12,288$ .

Different sparse extents are used for different modalities in Sec. 3.2. For text, we set $(e^{w},e^{h},e^{s})=(1,1,\infty)$ , where $\infty$ denotes that the full text is always used in attention. For image and image sketches, $(e^{w},e^{h},e^{s})=(3,3,1)$ . For video and video sketches, $(e^{w},e^{h},e^{s})=(3,3,3)$ .

We pre-train on 64 A100 GPUs for two weeks with the layer $L$ in Eq. (14) set to 24, an Adam optimizer with a learning rate of 1e-3, a batch size of 128, and warm-up 5% of a total of 50M steps. The final pre-trained model has a total number of 870M parameters.

2 Comparison with state-of-the-art

Text-to-Image (T2I) fine-tuning: We compare NÜWA on the MSCOCO dataset quantitatively in Tab. 1 and qualitatively in Fig. 3. Following DALL-E, we use $k$ blurred FID score (FID- $k$ ) and Inception Score (IS) to evaluate the quality and variety respectively, and following GODIVA, we use CLIPSIM metric, which incorporates a CLIP model to calculate the semantic similarity between input text and the generated image. For a fair comparison, all the models use the resolution of $256\times 256$ . We generate 60 images for each text and select the best one by CLIP. In Tab. 1, NÜWA significantly outperforms CogView with FID-0 of 12.9 and CLIPSIM of 0.3429. Although XMC-GAN reports a significant FID score of 9.3, we find NÜWA generates more realistic images compared with the exact same samples in XMC-GAN’s paper (see Fig. 3). Especially in the last example, the boy’s face is clear and the balloons are correctly generated.

Text-to-Video (T2V) fine-tuning: We compare NÜWA on the Kinetics dataset quantitatively in Tab. 2 and qualitatively in Fig. 4. Following TFGAN, we evaluate the visual quality on FID-img and FID-vid metrics and semantic consistency on the accuracy of the label of generated video. As shown in Tab. 2, NÜWA achieves the best performance on all the above metrics. In Fig. 4, we also show the strong zero-shot ability for generating unseen text, such as “playing golf at swimming pool” or “running on the sea”.

Video Prediction (V2V) fine-tuning: We compare NÜWA on BAIR Robot Pushing dataset quantitatively in Tab. 3. Cond. denotes the number of frames given to predict future frames. For a fair comparison, all the models use 64×64 resolutions. Although given only one frame as condition (Cond.), NÜWA still significantly pushes the state-of-the-art FVD score from 94±2 to 86.9.

Sketch-to-Image (S2I) fine-tuning: We compare NÜWA on MSCOCO stuff qualitatively in Fig. 5. NÜWA generates realistic buses of great varieties compared with Taming-Transformers and SPADE. Even the reflection of the bus window is clearly visible.

Image Completion (I2I) zero-shot evaluation: We compare NÜWA in a zero-shot manner qualitatively in Fig. 6. Given the top half of the tower, compared with Taming Transformers, NÜWA shows richer imagination of what could be for the lower half of the tower, including buildings, lakes, flowers, grass, trees, mountains, etc.

Text-Guided Image Manipulation (TI2I) zero-shot evaluation: We compare NÜWA in a zero-shot manner qualitatively in Fig. 7. Compared with Paint By Word , NÜWA shows strong manipulation ability, generating high-quality text-consistent results while not changing other parts of the image. For example, in the third row, the blue firetruck generated by NÜWA is more realistic, while the behind buildings show no change. This is benefited from real-world visual patterns learned by multi-task pre-training on various visual tasks. Another advantage is the inference speed of NÜWA, practically 50 seconds to generate an image, while Paint By Words requires additional training during inference, and takes about 300 seconds to converge.

Sketch-to-Video (S2V) fine-tuning and Text-Guided Video Manipulation (TV2V) zero-shot evaluation: As far as we know, open-domain S2V and TV2V are tasks first proposed in this paper. Since there is no comparison, we instead arrange them in Ablation Study in Section 4.3.

More detailed comparisons, samples, including human evaluations, are provided in the appendix.

3 Ablation Study

The above part of Tab. 4 shows the effectiveness of different VQ-VAE (VQ-GAN) settings. We experiment on ImageNet and OpenImages. $R$ denotes raw resolution, $D$ denotes the number of discrete tokens. The compression rate is denoted as $Fx$ , where $x$ is the quotient of $\sqrt{R}$ divided by $\sqrt{D}$ . Comparing the first two rows in Tab. 4, VQ-GAN shows significantly better Fréchet Inception Distance (FID) and Structural Similarity Matrix (SSIM) scores than VQ-VAE. Comparing Row 2-3, we find that the number of discrete tokens is the key factor leading to higher visual quality instead of compress rate. Although Row 2 and Row 4 have the same compression rate F16, they have different FID scores of 6.04 and 4.79. So what matters is not only how much we compress the original image, but also how many discrete tokens are used for representing an image. This is in line with cognitive logic, it’s too ambiguous to represent human faces with just one token. And practically, we find that $16^{2}$ discrete tokens usually lead to poor performance, especially for human faces, and $32^{2}$ tokens show the best performance. However, more discrete tokens mean more computing, especially for videos. We finally use a trade-off version for our pre-training: $21^{2}$ tokens. By training on the Open Images dataset, we further improve the FID score of the $21^{2}$ version from 4.79 to 4.31.

The below part of Tab. 4 shows the performance of VQ-GAN for sketches. VQ-GAN-Seg on MSCOCO is trained for Sketch-to-Image (S2I) task and VQ-GAN-Seg on VSPW is trained for Sketch-to-Video (S2V) task. All the above backbone shows good performance in Pixel Accuracy (PA) and Frequency Weighted Intersection over Union (FWIoU), which shows a good quality of 3D sketch representation used in our model. Fig. 8 also shows some reconstructed samples of 336×336 images and sketches.

Tab. 5 shows the effectiveness of multi-task pre-training for the Text-to-Video (T2V) generation task. We study on a challenging dataset, MSR-VTT, with natural descriptions and real-world videos. Compared with training only on a single T2V task (Row 1), training on both T2V and T2I (Row 2) improves the CLIPSIM from 0.2314 to 0.2379. This is because T2I helps to build a connection between text and image, and thus helpful for the semantic consistency of the T2V task. In contrast, training on both T2V and V2V (Row 3) improves the FVD score from 52.98 to 51.81. This is because V2V helps to learn a common unconditional video pattern, and is thus helpful for the visual quality of the T2V task. As a default setting of NÜWA, training on all three tasks achieves the best performance.

Tab. 7 shows the effectiveness of 3D nearby attention for the Sketch-to-Video (S2V) task on the VSPW dataset. We study on the S2V task because both the encoder and decoder of this task are fed with 3D video data. To evaluate the semantic consistency for S2V, we propose a new metric called Detected PA, which uses a semantic segmentation model to segment each frame of the generated video and then calculate the pixel accuracy between the generated segments and input video sketch. The default NÜWA setting in the last row, with both nearby encoder and nearby decoder, achieves the best FID-vid and Detected PA. The performance drops if either encoder or decoder is replaced by full attention, showing that focusing on nearby conditions and nearby generated results is better than simply considering all the information. We compare nearby-sparse and axial-sparse in two-folds. Firstly, the computational complexity of nearby-sparse is $O\left(\left(hws\right)\left(e^{h}e^{w}e^{s}\right)\right)$ and axis-sparse attention is $O\left(\left(hws\right)\left(h+w+s\right)\right)$ . For generating long videos (larger $s$ ), nearby-sparse will be more computational efficient. Secondly, nearby-sparse has better performance than axis-sparse in visual generation task, which is because nearby-sparse attends to “nearby” locations containing interactions between both spatial and temporal axes, while axis-sparse handles different axis separately and only consider interactions on the same axis.

Fig. 9 shows a new task proposed in this paper, which we call “Text-Guided Video Manipulation (TV2V)”. TV2V aims to change the future of a video starting from a selected frame guided by text. All samples start to change the future of the video from the second frame. The first row shows the original video frames, where a diver is swimming in the water. After feeding “The diver is swimming to the surface” into NÜWA’s encoder and providing the first video frame, NÜWA successfully generates a video with the diver swimming to the surface in the second row. The third row shows another successful sample that lets the diver swim to the bottom. What if we want the diver flying to the sky? The fourth row shows that NÜWA can make it as well, where the diver is flying upward, like a rocket.

Conclusion

In this paper, we present NÜWA as a unified pre-trained model that can generate new or manipulate existing images and videos for 8 visual synthesis tasks. Several contributions are made here, including (1) a general 3D encoder-decoder framework covering texts, images, and videos at the same time; (2) a nearby-sparse attention mechanism that considers the nearby characteristic of both spatial and temporal axes; (3) comprehensive experiments on 8 synthesis tasks. This is our first step towards building an AI platform to enable visual world creation and help content creators.

References

Appendix A Comparisons between 3D Sparse Attentions

Fig. 10 shows comparisons between different 3D sparse attentions. Assume we have 3D data with the size of $4\times 4\times 2$ , the idea of 3D block-sparse attention is to split the 3D data into several fixed blocks and handle these blocks separately. There are many ways to split blocks, such as splitting in time, space, or both. The 3D block-sparse example in Fig. 10 considers the split of both time and space. The 3D data is divided into 4 parts, each has the size of $2\times 2\times 2$ . To generate the orange token, 3D block-sparse attention considers previous tokens inside the fixed 3D block. Although 3D block-sparse attention considers both spatial and temporal axes, this spatial and temporal information is limited and fixed in the 3D block especially for the tokens along the edge of the 3D block. Only part of nearby information is considered since some nearby information outside the 3D block is invisible for tokens inside it. The idea of 3D axial-sparse attention is to consider previous tokens along the axis. Although 3D axis-sparse attention considers both spatial and temporal axes, this spatial and temporal information is limited along the axes. Only part of nearby information is considered and some nearby information that does not in the axis will not be considered in the 3D axis attention. In this paper, we propose a 3D nearby-sparse, which considers the full nearby information and dynamically generates the 3D nearby attention block for each token. The attention matrix also shows the evidence as the attended part (blue) for 3D nearby-sparse is more smooth than 3D block-sparse and 3D axial-sparse.

Tab. 7 shows the complexity of different 3D sparse attention. $h,w,s$ denotes the spatial height, spatial width, and temporal length of the 3D data. Different sparse mechanisms have their computational advantages in different scenarios. For example, for long videos or high-resolution frames with large $h,w,s$ , usually $\left(e^{h}e^{w}e^{s}\right)<(h+w+s)$ , and 3D nearby-sparse attention is more efficient than 3D axial-sparse attention. If the 3D data can be split into several parts without dependencies, 3D block-sparse will be a good choice. For example, a cartoon with several episodes and each tells a separate story, we can simply split these stories as they share no relationship.

Appendix B Details of Multi-task Pre-training

Tab. 8 shows the implementation details of two NÜWA settings used in this paper. Both NÜWA-256 and NÜWA-336 models are trade-off between image quality and video length (number of video frames). As the image quality highly relies on compression ratio and number of discrete tokens, and low compression ratio and large discrete tokens are key factors for high quality image. However, as the total capacity of the model is limited, the number of discrete tokens per image and the number of video frames (images) are a compromise.

Note that NÜWA-256 adopts a compression ratio of F8 and the discrete tokens is 32×32, while NÜWA-336 adopts a compression ratio of F8 and the discrete tokens is only 21×21. To make a fair comparison with current state-of-the-art models, we adopt NÜWA-256 with more discrete tokens to generate high quality images. However, NÜWA-256 can only generate videos with 4 frames considering the efficiency of transformer. To handle relatively long videos, NÜWA-336 with fewer discrete tokens can generate videos with 10 frames. As a result, NÜWA-336 significantly relieves the pressure of the auto-regressive models in the second stage, especially for videos. NÜWA-336 is the default setting to cover both images and videos.

For both models, note that we did not over-adjust the parameters and just use the same learning rate of $10^{-3}$ and 50M training steps.

Appendix C Human Evaluation

Fig. 11 presents human comparison results between CogView and our NÜWA on the MSCOCO dataset for Text-to-Image (T2I) task. We randomly selected 2000 texts and ask annotators to compare the generated results between two models including both visual quality and semantic consistency. The annotators are asked to choose among three options: better, worse, or undetermined. In the visual quality part, There are 62% votes for our NÜWA model, 15% undetermined, and 23% votes for CogView, which shows NÜWA generates more realistic images. In the semantic consistency part, although 67% of votes cannot determine which model is more consistent with the text, NÜWA also wins the remaining 21% votes. Although CogView is pretrained on larger text-image pairs than NÜWA, our model still benefits from multi-task pretraining, as text-videos pairs provides high-level semantic information for text-to-image generation.

Fig. 12 shows human comparison results between VQ-GAN and our NÜWA model on the MSCOCO dataset for the Image Completion (I2I) task. We use similar settings as Fig. 11, but removed semantic consistency as there is no text input for this task. The comparison results show that there are 89% votes for NÜWA, which shows the strong zero-shot ability of NÜWA.