OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, Li Yuan

Introduction

Video generation has gained significant attention in both academia and industry, especially after the announcement of OpenAI’s SORA . Currently, Latent Video Diffusion Models (LVDMs), such as MagicTime , VideoComposer , AnimateDiff , Stable Video Diffusion (SVD) , HiGen , Latte , SORA , Open-Sora , Open-Sora-Plan , have been the dominators in video generation for their stability, effectiveness, and scalability. These LVDMs share the same workflow: Variational Autoencoders (VAEs) compress origin videos into latent representations. Then, the denoisers are trained to predict the noise added to these compressed representations.

However, the most frequently used VAE by LVDMs, Stable Diffusion VAE (SD-VAE) , is initially designed for spatially compressing images instead of videos. When compressing a video, it treats each frame as an individual image, completely ignoring the redundancy in the temporal dimension. This results in temporally redundant latent representation, which increases the input size of the following denoisers, leading to great hardware consumption for LVDMs. In addition, the frame-wise compression of a video ignores the temporal information beneficial to reconstruction, causing lower reconstruction accuracy and reducing the quality of LVDMs’ generated results. Although the exploitation of temporal information is considered in the decoder of Stable Video Diffusion VAE (SVD-VAE) , its compression of videos in the temporal dimension remains absent, which still brings a great hardware burden to LVDMs.

Furthermore, temporal compression for videos has been explored in some works about autoregressive-based video generation . They utilize VQ-VAEs to temp-spatially compress videos into discrete tokens and the following transformers are learned to predict these tokens. Although these VQ-VAEs can’t provide continuous latent representations for LVDMs, they still indicate the feasibility of temporal compression in LVDMs’ VAEs.

To relieve the hardware burden of LVDMs and enhance their video generation ability with limited resources, we propose an omni-dimensional compression VAE (OD-VAE), which can temporally and spatially compress videos into concise latent representations. Since a high temporal correlation exists in video frames, by strong 3D-Causal-CNN architecture , our OD-VAE can reconstruct video accurately with additional temporal compression. The sufficient compression and effective reconstruction of OD-VAE will greatly improve the efficiency of LVDMs. To achieve a better trade-off between video reconstruction quality and compression speed, significant to the video generation results of LVDMs and their training speeds, respectively, we introduce and analyze four model variants of OD-VAE. To train our OD-VAE more efficiently, we propose a novel tail initialization to exploit the weight of SD-VAE. Besides, we propose novel temporal tiling, a split but one-frame overlap inference strategy, enabling OD-VAE to handle videos of arbitrary length with limited GPU memory.

Our contributions are summarized as follows:

We propose OD-VAE, an omni-dimensional video compressor with a high reconstructed accuracy, which improves the efficiency of LVDMs.

To achieve a better trade-off between video reconstruction quality and compression speed, we introduce and analyze four model variants of OD-VAE.

To further improve the training efficiency and inference ability of our OD-VAE, we propose novel tile initialization and temporal tiling, respectively.

Extensive experiments and ablations on video reconstruction and LVDM-based Video generation demonstrate the effectiveness and efficiency of our methods.

Related Work

Latent Video Diffusion Models (LVDMs) is a significant task in artificial intelligence . It first use VAEs to compress videos into latent representations and then utilize denoisers to predict the noise added to them, have been developing rapidly since last year. The OpenAI’s SORA that can generate videos of 1080P resolution and one minute long, greatly shocks the world. LVDMs can be divided into two kinds in terms of the structures of their denoisers. The first kind uses U-net-based denoisers , such as MagicTime , AnimateDiff , and Stable Video Diffusion (SVD) . While the second kind utilizes Transformer-based denoisers , such as Latte , SORA , Open-Sora , and Vidu . Whatever the structures of the denoisers are, the VAEs determine the sizes of the inputs to denoisers and the reconstructed accuracy from latent representations to videos. Thus, VAEs that provide concise representation while maintaining high reconstruction quality will greatly improve the efficiency of LVDMs.

2 Variational Autoencoder

Variational Autoencoder (VAE) is initially designed for generation tasks by maximizing the Evidence Lower Bound (ELBO) of date. Gradually, it has become a preceding component of other generation models and can be divided into two types. The first is VQ-VAEs , which compress videos into discrete tokens and are used by autoregressive-based video generation models . In these VQ-VAEs, temporal compressions for videos have existed, and the 3D-causal-CNN-based MAGVIT-v2 achieves state-of-the-art video reconstruction. However, the discrete representations provided by VQ-VAEs are unsuitable for LVDMs. The second is continuous VAEs, which compress videos into continuous representations and are used by LVDMs. Among them, Stable Diffusion VAE (SD-VAE) , and its decoder enhancement version, Stable Video Diffusion VAE (SVD-VAE) , are the most popular. However, they only spatially compress videos while ignoring the temporal redundancy of videos. Besides, we have discovered two works that are concurrent with ours. One is OPS-VAE , which utilizes two cascading VAEs to spatially and temporally compress videos, respectively. The other is CV-VAE , which proposes a temporally compressed VAE but focuses more on latent space alignment to SD-VAE. We will comprehensively compare our OD-VAE and them in the experiment.

Method

In this section, we first provide the overview of OD-VAE, shown in Fig. 1. Then, we discuss the four model variants of OD-VAE, shown in Fig. 2. Finally, we introduce the tail initialization and temporal tiling.

Our OD-VAE adopts 3D-causal-CNN architecture to temporally and spatially compress videos into concise latent representations and can reconstruct them accurately, as shown in Fig. 1. Since the structure of SD VAE is mature and stable, the basic design of our 3D-causal-CNN architecture is derived from it, which will be introduced in the next subsection. Let $\mathcal{E}$ and $\mathcal{D}$ denote the encoder and the decoder of our OD-VAE, respectively. A video containing $N+1$ frames is denoted as $\boldsymbol{X}=[\boldsymbol{x_{1}},\boldsymbol{x_{2}},...,\boldsymbol{x_{N+1}}]\in\mathcal{R}^{(N+1)\times H\times W\times 3}$ , and the $i$ -th frame of $\boldsymbol{X}$ is expressed as $\boldsymbol{x_{i}}\in\mathcal{R}^{H\times W\times 3}$ . The compressed latent representation of $\boldsymbol{X}$ is denoted as $\boldsymbol{Z}\in\mathcal{R}^{(n+1)\times h\times w\times c}$ . When processing the video $\boldsymbol{X}$ , OD-VAE keeps the temporal independence of its first frame $\boldsymbol{x_{1}}$ , and only spatially compresses it. In contrast, the following frames $\boldsymbol{x_{i}}(i>1)$ will be compressed in both the temporal and spatial dimensions. This can be formulated as:

The reconstruction is the inverse of the compression. We use $\boldsymbol{\hat{X}}\in\mathcal{R}^{(N+1)\times H\times W\times 3}$ to express the reconstructed video and the process can be formulated as:

The temporal and spatial compression rates of OD-VAE are $c_{t}=\frac{n}{N}$ and $c_{s}=\frac{h}{H}=\frac{w}{W}$ , respectively. We set $c_{s}=8$ following SD-VAE and find $c_{t}=4$ will be a good trade-off between sufficient compression and accurate reconstruction.

2 Model Variants of OD-VAE

Since video compression is necessary for the training of LVDMs, increasing the compression speed of our OD-VAE can greatly improve their training efficiency. Hence, we introduce and analyze four different model variants of our OD-VAE, aiming to achieve a better trade-off between the compression speed and video reconstruction quality.

Variant 1. An easy way to extend SD VAE to our 3D-causal-CNN-based OD-VAE is inflating all the 2D convolutions into 3D convolutions by adding a temporal dimension to all 2D kernels, shown in Fig. 2(a). The video reconstruction ability of variant 1 is the best since its full-3D architecture can completely exploit the temporal and spatial information in the video by making features temp-spatially interact at each convolution. However, numerous expensive 3D convolutions in the network lead to a slow compression speed, lowering the training efficiency of LVDMs.

Variant 2. Since numerous 3D convolutions in variant 1 lead to a slow compression speed, we utilize an intuitive way to reduce expensive 3D convolutions. Specifically, we replace half of the 3D convolutions in variant 1 with 2D convolutions and obtain variant 2, shown in Fig. 2(b). In variant 2, half of its convolutions are limited to only conducting spatial transformation for the input features, lowering the computational consumption of compression. As half of the convolutions can still process the features omni-dimensionally, abundant temporal and spatial information in a video is still well utilized, guaranteeing its reconstruction ability.

Variant 3. However, in variant 1, the consumption of each 3D convolution is different. The 3D convolutions in the outer blocks process large-sized features with huge expense while those in the inner blocks process small-sized features with little expense. Hence, replacing a 3D convolution in an outer block leads to a greater reduction in consumption than replacing one in an inner block. Based on this, we utilize a more reasonable replacement strategy for variant 1 and obtain variant 3. Specifically, we replace all the 3D convolutions in some outer blocks with 2D convolutions while maintaining the other inner blocks unchanged, shown in Fig. 2(c). With this strategy, the compression speed of variant 3 will probably be faster than that of variant 2.

Variant 4. Since the decoder of OD-VAE doesn’t participate in video compression, the convolution replacement in the decoder can’t improve the training efficiency of LVDMs while lowering the reconstruction accuracy. Therefore, we keep the decoder of variant 1 unchanged and only replace the 3D convolutions in the outer blocks of the encoder with 2D convolutions and obtain variant 4, shown in Fig. 2(d). With a full 3D decoder, the video reconstruction ability of variant 4 will probably be better than that of variant 3.

3 Tail Initialization and Temporal Tiling

Tail Initialization. Notably, when $N=0$ , the video $\boldsymbol{X}$ degrades as an image and our OD-VAE can be viewed as an image VAE. This brings the potential for OD-VAE to inherit the spatial compression and reconstruction ability of powerful SD VAE. With this inheritance of ability in the spatial dimension, the training efficiency of our OD-VAE is higher, since the spatial prior will accelerate the convergence of our model. Hence, for better inheritance, we design a special initialization method to utilize the weight of 2D SD-VAE perfectly, named tail initialization. Specifically, we denote a 5 dimension 3D convolution kernel in the OD-VAE as $\boldsymbol{K_{3D}}\in\mathcal{R}^{I\times O\times T\times H\times W}$ , and its corresponding 4 dimension 2D kernel in SD VAE as $\boldsymbol{K_{2D}}\in\mathcal{R}^{I\times O\times H\times W}$ . For $\boldsymbol{K_{3D}}$ , we use the weight of $\boldsymbol{K_{2D}}$ to initial its temporally last element and set other elements to , expressed as:

We use $\boldsymbol{F_{3D}}$ and $\boldsymbol{F_{2D}}$ to denote the input feature maps of $\boldsymbol{K_{3D}}$ and $\boldsymbol{K_{2D}}$ , respectively. With tail initialization, before training, our OD-VAE satisfies the following equation:

The equation means that our OD-VAE can compress an image into a latent representation and reconstruct it accurately as SD-VAE without learning. This indicates that the spatial compression and reconstruction ability of SD-VAE is completely transferred to our OD-VAE. The strong spatial prior accelerates the convergence of our OD-VAE, greatly enhancing the training efficiency.

Temporal Tiling. Since long video generation has been a main trend, enabling our OD-VAE to handle videos of arbitrary length with limited GPU memory is necessary. Hence, we design a split but one-frame overlap inference strategy, named temporal tiling. Specifically, we temporally split a video $X$ into $M$ groups, denoting as $[\boldsymbol{X_{1}},\boldsymbol{X_{2}},...,\boldsymbol{X_{M}}]$ . The last frame of $\boldsymbol{X_{i}}$ and the first frame of $\boldsymbol{X_{i+1}}$ are the same. We compress each group $\boldsymbol{X_{i}}$ into latent representation $\boldsymbol{Z_{i}}$ individually. Then, we drop the first frames of $\boldsymbol{Z_{i}}$ when $i>1$ and concatenate $\boldsymbol{Z_{i}}(1\leq i\leq M)$ along temporal dimension to obtain $\boldsymbol{Z}$ . We introduce the same grouping mechanism to the reconstructed video $\boldsymbol{\hat{X}}$ that $\boldsymbol{\hat{X}}=[\boldsymbol{\hat{X}_{1}},\boldsymbol{\hat{X}_{2}},...,\boldsymbol{\hat{X}_{M}}]$ . To reconstruct $\boldsymbol{Z}$ as $\boldsymbol{\hat{X}}$ , we first decode $\boldsymbol{Z_{i}}$ into $\boldsymbol{\hat{X}_{i}}$ individually. Then, we drop the first frames of $\boldsymbol{\hat{X}_{i}}$ when $i>1$ and concatenate $\boldsymbol{\hat{X}_{i}}(1\leq i\leq M)$ along temporal dimension. As a high temporal correlation exists in video frames, the overlap can connect each group well and greatly reduce compressed and reconstructed errors.

Experiment

In this section, we first introduce the experimental setting, including models, training strategy, and evaluation details. Then, comprehensive comparisons between OD-VAE and other baselines on video reconstruction and LVDM-based video generation are conducted to demonstrate the superiority of our OD-VAE. Finally, extensive ablations are provided to certify the effectiveness of our proposed methods.

Models. To demonstrate the effectiveness and efficiency of our OD-VAE, we compare it with six other state-of-the-art commonly used VAEs in terms of video reconstruction and LVDM-based video generation, including: (1) VQGAN : a widely used image VQ-VAE. Following , we use its f8-8192 version in our experiment. (2) TATS : a 3D video VQ-VAE applied to autoregressive-based video generation. (3) SD-VAE : the most frequently used image VAE by LVDMs. Following , we use its numerically stable version, SD2.1-VAE. (4) SVD-VAE : A video VAE obtained by enhancing the decoder of SD-VAE. It shares the same encoder structure as SD-VAE. (5) CV-VAE : a video VAEs contemporaneous with our research. (6) OPS-VAE : another video VAEs also contemporaneous with our research. It first conducts spatial downsample then temporal downsample to an input video. As discrete VQGAN and TATS aren’t suitable for LVDMs, they are only used for experiments on video reconstruction. In the method section, we introduce four model variants of our OD-VAEs. We use variant 4 of our OD-VAE to compare to other baselines, since according to the ablations, variant 4 achieves the best trade-off between the video reconstruction quality and compression speed among all the variants.

Training strategy. We use Adma optimizer to train our OD-VAE for 650k steps, with a constant learning rate $1\times 10^{-5}$ and batch size 8. The training dataset of our OD-VAE contains 440k self-scrape internet videos and 220k videos from the K400 dataset . During training, all the input videos are processed to clips of 25-frame length and $256\times 256$ resolution. Following , the loss function contains a reconstruction term, a KL term, and an adversarial term . To obtain more stable training results, following , we utilize an exponential moving average (EMA) of OD-VAE weights over training with a decay of 0.999. Since SD2.1-VAE is numerically stable, We use its weights to initialize our OD-VAE, enhancing the training efficiency. The training is conducted on 8 NVIDIA 80G A100 GPUs with Pytorch .

Evaluation details. For evaluation on video reconstruction, we select two popular large open-domain video datasets, WebVid-10M and Panda-70M . we only use their validation sets for efficiency and fairness. For each video in these two validation sets, we transform it to a clip of 25-frame length and $256\times 256$ resolution. To quantify models’ video reconstruction ability, we use three popular metrics, peak signal-to-noise ratio (PSNR) , structural similarity index measure (SSIM) , and Learned Perceptual Image Patch Similarity (LPIPS) . We also use the video compression rate (VCPR) and the number of parameters (Params) to denote the video compression level and the network complexity of these VAEs, respectively. To evaluate these VAEs’ effect on LVDM-based video generation, we fix the structure of the denoiser and change its previous VAE. We select Latte’s denoiser , since it uses a novel SORA-like transformer-based structure and achieves excellent results in LVDM-based video generation. Following , we choose two public datasets, UCF101 and SkyTimelapse , for class-conditional and unconditional generation respectively. We use almost the same setting introduced in to train Latte’s denoiser with these VAEs for 200k steps. The only difference is that we use longer video clips of 81-frame length and adjust the batch size to fit the memory limitation of a single GPU. To assess the quality of the generated videos on the two datasets, we employ two popular metrics, Frechet Video Distance (FVD) and Kernel Video Distance (KVD) . In addition, we also report models’ Inception Score (IS) on the UCF101 dataset, calculated by a trained C3D model . These metrics are calculated based on 2048 samples. To measure LVDM’s efficiency with different VAEs, we list their training GPU memory consumption (TMem) and training speed (TSpeed) on the two datasets. These tests are conducted on NVIDIA 80G A100 GPUs.

2 Comparison with Other Baselines

We display the video reconstruction results of our OD-VAE and other baselines in Table. 1. The results in Table. 1 reflect that although OD-VAE can $4\times$ temporally compress videos, its reconstruction quality is not inferior to commonly used SD-VAE and SVD-VAE. For example, the PSNR and SSIM of our OD-VAE on the WebVid-10M validation set are 0.97 and 0.0315 higher than that of SD-VAE, respectively. Compared to SVD-VAE, although the overall performance of our OD-VAE is worse, its PSNR and SSIM on the WebVid-10M validation set are still slightly higher. This proves that our OD-VAE can fully exploit the temporal redundancy of video frames to obtain a more concise latent representation while maintaining high reconstructed quality. Furthermore, our OD-VAE behaves better than the two works concurrent with us, CV-VAE and OPS-VAE, which proves the effectiveness of our model design and training strategy. For example, the SSIM of OD-VAE on the WebVid-10M validation set is 0.0128 and 0.0125 higher than that of CV-VAE and OPS-VAE, respectively. On the Panda-70M validation set, the LPIPS of our OD-VAE is 0.0219 and 0.0211 lower than that of CV-VAE and OPS-VAE, respectively.

In Table. 2, we display the LVDM-based video generation results of our OD-VAE and other baselines. The results in Table. 2 show that, through $4\times$ temporal compression of VAEs, the efficiency of LVDM is greatly improved. On the two datasets, the video generation results of our OD-VAE are better than that of SD-VAE and SVD-VAE, while the training consumption is greatly reduced. For example, on the UCF101 dataset, with the same training steps, using our OD-VAE can achieve better FVD (370.16 lower than that of SD-VAE and 348.85 lower than that of SVD-VAE) and faster training speed ( $2.06\times$ that of SD-VAE and SVD-VAE). Furthermore, compared to CV-VAE and OPS-VAE, although the video compression rate is the same, our OD-VAE brings better video generation results and lower training consumption to LVDM. For example, on the SkyTimelapse dataset, with the same training steps, using our OD-VAE can obtain better FVD (32.55 lower than that of CV-VAE and 17.91 lower than that of OPS-VAE) and faster training speed (0.47it/s faster than that of SD-VAE and 0.27it/s faster than that of OPS-VAE). Besides, we show some visual results of LVDM with different VAEs on the SkyTimelapse dataset in Fig. 3. According to Fig. 3, with OD-VAE, LVDM can generate more realistic and high-quality videos.

3 Ablation Experiment

Model variant. To obtain the variant with the best trade-off between video reconstruction quality and compression speed, we train the four model variants of OD-VAE for 150k steps with the same setting mentioned above. We show their PSNR and LPIPS on the WebVid-10M validation set in Fig. LABEL:fig4 (a) and (b). Besides, we use their final checkpoints to train Latte’s denoiser on the UCF101 dataset with the same setting mentioned above and report their FVD in Fig. LABEL:fig4 (c). The compression speed (CSpeed) of the four variants, calculated by processing videos of 81-frame length and $256\times 256$ resolution, along with the training speed (TSpeed) of LVDM with them on the UCF101 dataset, are listed in Table. 3. According to Fig. LABEL:fig4 (a), (b), the PSNR and SSIM of variant 4 are slightly worse than that of variant 1 but better than the other variants. Since the reconstruction abilities of the four variants are close, using them as the preceding components of LVDM causes similar results of video generation, shown in Fig. LABEL:fig4 (c). However, according to the Table. 3, the compression speed of variant 4 is much faster than that of variant 1 and variant 2, bringing extreme efficiency enhancement to the training of LVDM. Hence, our OD-VAE utilizes variant 4 as the final structure, achieving the best trade-off between video reconstruction quality and compression speed.

Initialization Method. To verify the effectiveness of our tail initialization, we compare it with two other initialization methods, average initialization, and random initialization. Average initialization can be expressed as:

The random initialization means we randomly initialize our OD-VAE with Gaussian random numbers. We initialize our OD-VAE with the three methods and train the three versions for 150k steps with the same setting mentioned above, respectively. We show their PSNR and LPIPS on the WebVid-10M validation set in Fig. LABEL:fig4 (d) and (e). Besides, we use their final checkpoints to train Latte’s denoiser on the UCF101 dataset with the same setting mentioned above and report their FVD in Fig. LABEL:fig4 (f). According to Fig. LABEL:fig4 (d), (e), and (f), with the same training steps, using tail initialization can greatly improve the video reconstruction ability of our OD-VAE and the video generation quality of LVDM.

Temporal Tiling When directly compressing and reconstructing a video of $256\times 256$ resolution on an NVIDIA 80G A100 GPU, the maximum length of frames our OD-VAE can process is 125. With temporal tiling, our OD-VAE can handle a video in groups and the original length limitation disappears. This enables LVDM to generate longer videos. To evaluate the effect of temporal tiling on video reconstruction and LVDM-based video generation, we conduct experiments on the WebVid-10M validation set and the UCF101 dataset with the same setting mentioned above, respectively. We fix the length of a group to 33 and increase the frame length of the WebVid-10M validation clips from 33 to 97. In Table. 4, we list the PSNR and LPIPS on the WebVid-10M validation set, and the FVD and IS on the UCF101 dataset. According to Table. 4, with temporal tiling, these metrics slightly decrease, which means temporal tiling will not do much harm to the video reconstruction ability of our OD-VAE and the video generation quality of corresponding LVDM.

Conclusion

In this work, we proposed a novel omni-dimensional compression VAE for improving LVDMs, termed OD-VAE. It utilized effective 3D-causal-CNN architecture to $4\times$ temporally and $8\times$ spatially compress videos into latent representations while maintaining high reconstructed accuracy. These more concise representations reduced the input size of LVDMs’ denoisers, greatly improving the efficiency of LVDMs. To achieve a better trade-off between video reconstruction quality and compression speed, we introduced and analyzed four variants of our OD-VAE. To train OD-VAE more efficiently, we proposed a novel tail initialization to exploit the weight of SD-VAE perfectly. Besides, we proposed temporal tiling, a split but one-frame overlap inference strategy, enabling our OD-VAE to process videos of arbitrary length with limited GPU memory. Comprehensive experiments and ablations on video reconstruction and LVDM-based video generation demonstrated the effectiveness and efficiency of our proposed methods.