OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang

Introduction

The development of generative models has been one of the most exhilarating developments in artificial intelligence, offering the potential to revolutionize the way we generate visual content. In recent years, visual generation approaches have emerged as two dominant paradigms: language model-based methods and diffusion models . The former exploits the superior sequence modeling capability of language models (LMs) for visual generation by formulating it as a next-token prediction process, while the latter gradually transforms noise into coherent visual structures through a carefully crafted reverse diffusion process.

Core to both approaches is the tokenizer, which translates visual signals into latent representations, with LM tokenizers, also known as VQVAE, discretizing inputs into sequences of latent codes , and diffusion tokenizers, i.e., VAE, modeling their probability distributions within a latent space . Analogous to the role of the lexicon in a written language, tokenizers for visual synthesis dictate the upper bound of the generative models, thus attracting increasing attention in the community .

Existing tokenizers are designed specifically for either image or video inputs, resulting in inherent limitations regarding their application flexibility and data scalability for the following generative models. Although MAGVITv2 have explored causal 3D convolution to process both modalities, they still have to train separate models for the image and video data, without achieving the synergy between them. This work highlights the critical need for a joint image-video tokenizer with two primary considerations: firstly, a joint image-video tokenizer enables joint learning from image and video data , which mitigates the scarcity of data in a single modality (particularly video data) and facilitates the tokenizer to learn more general representations. In addition, a unified tokenization framework inherently enjoys better versatility and scalability. For instance, its performance can be improved by incorporating the data from either modality for training. This further promotes the efficacy of generative models tailored to image or video generation.

With this in mind, we present OmniTokenizer, a transformer-based tokenizer for joint image-video tokenization. As intuitive as it may seem, the simple unification of image and video data could not lead to the reciprocal effects between both modalities. To address this challenge, we turn to a spatial-temporal decoupled architecture , where window attention is employed in the spatial dimension owing to its local aggregation capacity and efficiency, and causal attention is used in the temporal dimension to capture the motion in videos and ensure temporal coherence. Complementing the model design, we introduce a progressive training strategy that begins with image pretraining on a fixed resolution to establish a fundamental understanding of static visual information. After this, we integrate video data for joint training on variable resolutions to capture the dynamics in more complex scenes. The progressive training strategy allows our method to bridge the gap between disparate forms of visual input and capitalize on the rich spectrum of visual data.

To empirically validate the effectiveness of the proposed method, we separately implement the LM and diffusion tokenizers, i.e., OmniTokenizer-VQVAE and OmniTokenizer-VAE, and conduct experiments on a wide range of datasets including ImageNet , CelebA-HQ , FFHQ , UCF-101 , Kinetics-600 , etc. The results demonstrate our model outperforms existing methods in terms of reconstruction FID on both image datasets (e.g., 1.11 rFID for OmniTokenizer-VQVAE and 0.69 rFID for OmniTokenizer-VAE on ImageNet) and video datasets (e.g., 42 rFVD for OmniTokenizer-VQVAE and 23 rFVD for OmniTokenizer-VAE on UCF-101). In addition, employing our approach for tokenization, we also show that both language model-based generative models and diffusion models could achieve competitive results on class-conditional, unconditional generation, and frame prediction tasks.

In summary, our work makes the following key contributions:

We introduce OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. For the first time, OmniTokenizer employs a shared framework and weight to handle both types of visual data.

We propose a progressive training strategy that begins with image pre-training at a fixed resolution and then transits to image-video joint training at multiple resolutions. Such an approach capitalizes on the synergies between image and video data, facilitating OmniTokenizer to achieve better performance than solo image or video training.

We conduct extensive experiments across various datasets like ImageNet, CelebA-HQ, FFHQ, UCF-101, and Kinetics-600. The results showcase the state-of-the-art reconstruction performance of OmniTokenizer on both image and video datasets. Furthermore, equipped with OmniTokenizer, both language model-based generative models and diffusion models could achieve superior generation results.

Related Work

Language models have emerged as powerful contenders in the visual generation field, drawing inspiration from their unparalleled success in natural language processing and visual understanding . These methods recast visual synthesis as a sequence prediction problem, similar to constructing sentences in human language.

Depending on whether the tokens are predicted sequentially or in parallel, LM-based methods can be further categorized into autoregressive models and non-autoregressive models . Autoregressive (AR) models have been the initial foray into visual generation, utilizing the inherent sequential nature of language models to generate images and videos in a step-wise fashion. These models, such as DALL-E and its preceding variants, typically work by predicting one token at a time and are characterized by their high-quality outputs and precise control over the generation process. VARredefines the autoregressive learning framework on images as coarse-to-fine "next-scale prediction" paradigm. Non-autoregressive (Non-AR) models, on the other hand, have been developed to allow for a faster generation process by predicting multiple tokens independently and in parallel. Models like MaskGIT leverage this parallelism to significantly reduce generation time while maintaining high fidelity in synthesized images. The non-AR approaches have also demonstrated promise in video generation, featured by MAGVIT series . Both AR and non-AR methods have significantly advanced the field of visual generation, offering novel methods to synthesize high-quality images and videos.

2 Diffusion Models for Visual Generation

Diffusion models represent an alternative avenue for visual generation, benefiting from their probabilistic nature that iteratively denoise a random signal into structured images or videos. These models stand out for their flexibility in generating visual outputs that not only exhibit coherent global structures but are also rich with intricate textures . Unlike language models that discretize visual inputs as latent codes, diffusion models directly generate visual samples in continuous pixel space . While effective, this approach demands significant computational resources given the high dimensionality of visual data.

The advent of latent diffusion models (LDMs) seeks to mitigate these issues by compressing the high-dimensional visual data into latent space with a pretrained Variational Autoencoder (VAE) . LDM preserves the desirable properties of pixel-space diffusion models, such as high-quality image synthesis and the ability to incorporate conditional information, while drastically reducing the training and sampling overhead. After that, the rise of LDMs continues to push visual generation toward higher quality, larger resolution, and more complex scenes.

Methodology

We aim to enable image and video tokenization in a unified framework and achieve mutual benefits between them. To accomplish this, we employ a transformer-based architecture with decoupled spatial and temporal blocks (Sec. 3.1.1). Complementing this, we also propose a progressive training strategy consisting of two consecutive stages to learn the visual encoding in an incremental way (Sec. 3.1.2). The overall framework of our method is illustrated in Figure 1.

Encoder and Decoder. To have better compatibility with image and video inputs, we adopt a spatial-temporal factorized encoder consisting of separate spatial and temporal blocks. In the spatial dimension, window attention is employed as it exhibits superior local aggregation capability and efficiency. While in the temporal dimension, we use causal attention to align with the autoregressive visual generation in the second stage. Next, the latent code $z$ could be obtained by looking up a codebook for LM tokenizer (i.e., quantization in VQVAE), or sampling from a Gaussian distribution for diffusion tokenizer.

The architecture of the decoder is symmetric with the encoder. Finally, we map the spatial-temporal tokens to the pixel space with two linear projection layers without any activation function.

1.2 Progressive Training

Unlike existing image tokenizers that conduct training on image data only or video tokenizers that train with image counterparts as intialization . We leverage a progressive training paradigm that involves two consecutive stages of VQ training to facilitate spatial-temporal representation learning of our LM tokenizer OmniTokenizer-VQVAE. After this, it could be fine-tuned as a diffusion tokenizer, OmniTokenizer-VAE, with KL fine-tuning.

Two-stage VQ Training, as depicted in Figure 2, aims to learn the visual reconstruction with the discrete latent codes. It includes two stages, the initial stage focuses on fixed-resolution image data to lay a foundation for spatial understanding. Building upon this, the second stage introduces video data to learn the modeling of temporal dynamics alongside static image features. This image-video joint training stage is critical for the model to learn a universal embedding that accurately captures both the spatial intricacies of individual frames and the temporal relationships of sequential video data.

During both stages, the model is trained with vector-quantization objective:

KL fine-tuning. After the VQ training, we further fine-tune our model as a diffusion tokenizer (i.e., OmniTokenizer-VAE) by replacing the above $\mathcal{L}_{VQ}$ with Kullback-Leibler (KL) loss:

where $P(z)$ is Gaussian distribution, $Q(z|x)$ represents the inferred posterior configurations of the latent code given the observed input.

Besides $\mathcal{L}_{VQ}$ or $\mathcal{L}_{KL}$ , both VQ training and KL fine-tuning also employs $L_{2}$ reconstruction loss $\mathcal{L}_{recon}$ and GAN loss $\mathcal{L}_{GAN}$ .

2 Visual Generation

As mentioned in Sec. 3.1.2, after the progressive training and KL fine-tuning, we can obtain two tokenizers: OmniTokenizer-VQVAE and OmniTokenizer-VAE which separately encode the visual inputs into latent codes in a discrete codebook or the continuous latent space. With this, we further train language models or diffusion models for visual generation.

Language models-based generation approaches formulate visual synthesis as a token prediction problem. Specifically, after OmniTokenizer-VQVAE tokenizes image or video inputs into a sequence of discrete latent codes, we first flatten them in the raster order to obtain the code indices $y$ . Then a transformer language model is trained to maximize the log-likelihood between the predicted tokens $\hat{y}$ and the target tokens $y$ with cross-entropy loss:

Latent diffusion models (LDMs) perform diffusion process in the latent space to enable high-quality image synthesis with improved computational efficiency. Specifically, with the 2D latent representation from OmniTokenizer-VAE, the diffusion process gradually applies Gaussian noise to the latent code to generate a perturbed sample, while the denoising process trains a diffusion model to predict the noise that has been added. During inference, the well-trained diffusion model could generate a coherent visual sample from the noise by iteratively reversing the noising process.

Experiments

Datasets. We evaluate the visual tokenization performance of OmniTokenizer on both image and video datasets, including ImageNet , CelebA-HQ , FFHQ , Kinetics , UCF-101 , Moments-in-Time (MiT) , and Something-Something v2 (SSV2) . We adopt a subset of the above datasets for visual generation to compare with previous works .

Implementation Details. OmniTokenizer adopts a decoupled spatial-temporal architecture consisting of 4 window attention-based spatial layers (window size = 8) and 4 causal attention-based temporal layers. The hidden dimension is 512 and the latent dimension is 8, following ViT-VQGAN . $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ are set to 1, 1, 1e-6, respectively. As mentioned in Sec. 3.1.2, the training of OmniTokenizer follows a progressive training strategy, where both stages last 500K iterations. The learning rate is warmed up to 1e-3 and decayed to 0 using a cosine scheduler. Adam is employed for optimization ( $\beta$ 1 = 0.9 and $\beta$ 2 = 0.99). During the image training stage, we train the model with a fixed image resolution of 256 $\times$ 256. For the joint training stage, we forward the model with image and video data iteratively, with the video sequence length being 17 frames. The spatial resolutions are randomly chosen from 128, 192, 256, 320, and 384. Only random horizontal flip is adopted for data augmentation. We train our model using 8 NVIDIA A100 GPUs for 2 weeks. Unless otherwise stated, the results reported in this paper are jointly trained on ImageNet and UCF-101.

We try both the language models and diffusion models for visual generation with OmniTokenizer as the tokenizer. The configuration for the language model follows VQGAN , and for a fair comparison with previous methods, we also scale up the model size by increasing the hidden dimension to 1535, following ViT-VQGAN . The training of image and video diffusion transformers follows DiT and Latte , respectively.

We first evaluate the visual tokenization capability of OmniTokenizer on ImageNet and two high-quality face datasets, CelebA-HQ and FFHQ. Reconstruction FID is used following the previous methods . We can observe from Table 2 that with the same compression rate and codebook size, OmniTokenizer outperforms existing methods by a large margin on all these datasets. Especially, OmniTokenizer-VQVAE achieves 1.11 FID on ImageNet, beating ViT-VQGAN, the previous state-of-the-art method by 13%. When fine-tuned as OmniTokenizer-VAE, the FID is further reduced to 0.69. We hypothesize the improved performance is because KL training provides smoother gradients than VQ training and avoids loss of information in the quantization process.

In addition, we also conduct video reconstruction experiments and report the results in Table 2. We can see that on both UCF-101 and Moments-in-Time datasets, OmniTokenizer achieves the best results. The video reconstruction results on more datasets can be found in the ablation study.

2 Visual Generation with AutoRegressive Transformers

Using OmniTokenizer-VQVAE for tokenization, we train language models to predict latent code indices in the codebook in an autoregressive manner for image and video synthesis. The class-conditional 256 $\times$ 256 generation results on ImageNet, presented in Table 4, demonstrate that our model surpasses existing autoregressive image generation methods with significant margins. Remarkably, with a model comprising only 227M parameters, we achieve 10.13 FID and 94.5 IS, outperforming VQGAN by 32% and 25%, respectively. Upon scaling up to a larger model with 650M parameters, the FID is further reduced to 7.45.

In the domain of video generation, as illustrated in Table 4, our model beats the previous state-of-the-art autoregressive model, TATS for class-conditional video generation on UCF-101 with much lower FVD (283 $v.s.$ 314). Moreover, for frame prediction tasks on the Kinetics-600 dataset, our model not only achieves the best performance compared to other autoregressive models but also surpasses Phenaki , a non-autoregressive method.

3 Visual Generation with Diffusion Models

In parallel to language model-based methods, diffusion model , especially latent diffusion model , is another promising technique for visual synthesis. Therefore, we also evaluate the effectiveness of our method on diffusion model-based image and video generation with OmniTokenizer-VAE as the tokenizer. Here we employ the same architecture of DiT and Latte and replace their VAE with OmniTokenizer-VAE. DiT first applies the transformer architecture to latent diffusion models and exhibits appealing scalability properties. Following this, Latte extends the transformer to the latent video diffusion model by alternating spatial and temporal attention blocks.

The experimental results, as depicted in Table 6, indicate that when equipped with OmniTokenizer-VAE, DiT-XL/2 with classifier-free guidance (CFG) achieves a better inception score of 244.23, underscoring the efficacy of our tokenizer within diffusion model frameworks for image synthesis. For unconditional video generation on the UCF-101 dataset, our method not only offers the advantage of reduced training costs by realizing a higher compression rate, but also exhibits a much lower FVD than previous methods.

4 Ablation Study

Training Paradigms. To verify the effect of the proposed progressive training paradigm, we compare different training strategies and show the results in Table 7. The results in lines 3-4 and line 6 indicate that joint training outperforms video training on all video datasets remarkably, demonstrating the importance of image pre-training for the following video training. In addition, although joint training on a fixed resolution (line 5) could achieve much better results on video datasets than video training, the reconstruction FID on ImageNet gets worse, i.e., from 1.28 to 1.35. Comparatively, the progressive training paradigm leads to the best performance on video datasets and surprisingly improves the image reconstruction performance.

Architecture and Efficiency Analysis. In Table 3, we compare the inference cost (GFLOPs, i.e., giga floating-point operations, a hardware-independent metric) and reconstruction FID of different architectures on ImageNet. Compared to spatial-temporal joint attention (JointAttn) and decoupled plain attention (DePlainAttn), our decoupled architecture with spatial window attention and temporal causal attention leads to the lowest inference overhead and best rFID.

Latent Dimension and Compression Rate. Figure 3 shows the reconstruction FID with different compression rates and latent dimensions. We can observe that increasing the compression rate always hurts the reconstruction performance since more information is lost during the encoding process. Moreover, latent dimension = 8 leads to the best trade-off between rFID and codebook usage.

5 Visualizations

Visual Reconstruction. We visualize the reconstruction results by OmniTokenizer, VQGAN and TATS in Figure 4. Our method works significantly better than baselines for face and text reconstruction, which are typically regarded as the most challenging reconstruction cases.

Class-conditional Image and Video Generation. The class-conditional generation results are shown in Figure 5-8. Our model could synthesize visually coherent and contextually accurate images and videos, showcasing the strengths of OmniTokenizer in facilitating generative tasks.

Frame Prediction and Arbitrary Long Video Generation. The frame prediction results by our method are presented in Figure 9, from which we can see that our model could forecast subsequent frames with high clarity and temporal coherence. Moreover, we exhibit the potential of our method for generating videos of arbitrary lengths by employing a cyclical process, where each newly generated frame is recursively used as a condition for the subsequent frame generation.

Conclusion and Discussion of Broader Impact

This paper presented OmniTokenizer, a transformer-based tokenizer for joint image-video tokenization. OmniTokenizer adopts a spatial-temporal decoupled architecture, employing the window and causal attention in the spatial and temporal dimensions. To realize the synergy between images and video data, we proposed a progressive training strategy that starts with image training on a fixed resolution to acquire the spatial encoding capability and then incorporates video data for multi-resolution joint training to learn temporal modeling. Extensive experimental results substantiate the state-of-the-art performance of OmniTokenizer in visual reconstruction tasks. Further, when equipped with OmniTokenizer, both language model-based methods and diffusion models could achieve superior visual generation results.

Previous literature has revealed that the performance of transformer models improves significantly as the model size increases, also known as scaling law. In the future, we will explore scaling the model capacity of OmniTokenizer for more advanced tokenization performance.