Pyramidal Flow Matching for Efficient Video Generative Modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, Zhouchen Lin

cs.CV cs.LG

Introduction

Video is a media form that records the evolvement of the physical world. Teaching the AI system to generate various video content plays a vital role in simulating the real-world dynamics (Hu et al., 2023; Brooks et al., 2024) and interacting with humans (Bruce et al., 2024; Valevski et al., 2024). Nowadays, the cutting-edge diffusion models (Ho et al., 2022c; Blattmann et al., 2023a; OpenAI, 2024) and autoregressive models (Yan et al., 2021; Hong et al., 2023; Kondratyuk et al., 2024) have made remarkable breakthroughs in generating realistic and long-duration video through scaling of data and computation. However, the necessity of modeling a significantly large spatiotemporal space makes the training of such video generative models computationally and data intensive.

To ease the computational burden of generating high-dimensional video data, a crucial component is to compress the original video pixels into a lower-dimensional latent space using a VAE (Kingma & Welling, 2014; Esser et al., 2021; Rombach et al., 2022). However, the regular compression rate (typically 8 $\times$ ) still results in excessive tokens, especially for high-resolution samples. In light of this, prevalent approaches utilize a cascaded architecture (Ho et al., 2022b; Pernias et al., 2024; Teng et al., 2024) to break down the high-resolution generation process into multiple stages, where samples are first created in a highly compressed latent space and then successively upsampled using additional super-resolution models. Although the cascaded pipeline avoids directly learning at high resolution and reduces the computational demands, the requirement for employing distinct models at different resolutions separately sacrifices flexibility and scalability. Besides, the separate optimization of multiple sub-models also hinders the sharing of their acquired knowledge.

This work presents an efficient video generative modeling framework that transcends the limitations of the previous cascaded approaches. Our motivation stems from the observation in Fig. 1(a) that the initial timesteps in diffusion models are quite noisy and uninformative. This suggests that operating at full resolution throughout the entire generation trajectory may not be necessary. To this end, we reinterpret the original generation trajectory as a series of pyramid stages that operate on compressed representations of different scales. Notably, the efficacy of image pyramids (Adelson et al., 1984) has been widely validated for discriminative neural networks (Lin et al., 2017; Wang et al., 2020) and more recently for diffusion models (Ho et al., 2022b; Pernias et al., 2024; Teng et al., 2024)and multimodal LLMs (Yu et al., 2023; Tian et al., 2024). Here, we investigate two types of pyramids: the spatial pyramid within a frame and the temporal one between consecutive frames (as illustrated in Fig. 1(b)). In such a pyramidal generation trajectory, only the final stage operates at full resolution, drastically reducing redundant computations in earlier timesteps. The main advantages are twofold: (1) The generation trajectories of different pyramid stages are interlinked, with the subsequent stage continuing to generate from the previous ones. This eliminates the need for each stage to regenerate from pure noise in some cascade models. (2) Instead of relying on separate models for each image pyramid, we integrate them into a single unified model for end-to-end optimization, which admits drastically-expedited training with more elegant implementation as validated by experiments.

Based on the aforementioned pyramidal representations, we introduce a novel pyramidal flow matching algorithm that builds upon recent prevalent flow matching framework (Lipman et al., 2023; Liu et al., 2023; Albergo & Vanden-Eijnden, 2023). Specifically, we devise a piecewise flow for each pyramid resolution, which together form a generative process from noise to data. The flow within each pyramid stage takes a similar formulation, interpolating between a pixelated (compressed) and noisier latent and a pixelate-free (decompressed) and cleaner latent. Thanks to our sophisticated design, they can be jointly optimized by the unified flow matching objective in a single Diffusion Transformer (DiT) (Peebles & Xie, 2023), allowing simultaneous generation and decompression of visual content without multiple separate models. During inference, the output of each stage is renoised by a corrective Gaussian noise, which contributes to maintaining the continuity of the probability path between successive pyramid stages. Furthermore, we formulate the video generation in an autoregressive manner that iteratively predicts the next video latent conditioned on previously generated history. Given the high redundancy in the full-resolution history, we curate a temporal pyramid sequence using progressively compressed, lower-resolution history as conditions, thereby further reducing the token count and improving training efficiency.

The collaboration of the spatial and temporal pyramids results in remarkable training efficiency for video generation. Compared to the commonly used full-sequence diffusion, our method significantly reduces the number of video tokens during training (e.g., $\leq$ 15,360 tokens versus 119,040 tokens for a 10-second, 241-frame video), thereby reducing both computational resources required and training time. By training only on open-source datasets, our model generate high-quality 10-second videos at 768p resolution and 24 fps. The core contributions of this paper are summarized as follows:

We present pyramidal flow matching, a novel video generative modeling algorithm that incorporates both spatial and temporal pyramid representations. Utilizing this framework can significantly improve training efficiency while maintaining good video generation quality.

The proposed unified flow matching objective facilitates joint training of pyramid stages in a single Diffusion Transformer (DiT), avoiding the separate optimization of multiple models. The support for end-to-end training further enhances its simplicity and scalability.

We evaluate its effectiveness on VBench (Huang et al., 2024) and EvalCrafter (Liu et al., 2024), with highly competitive performance among video generative models trained on public datasets.

Related Work

Video Generative Models have seen rapid progress with autoregressive models (Yan et al., 2021; Hong et al., 2023; Kondratyuk et al., 2024; Jin et al., 2024) and diffusion models (Ho et al., 2022c; Blattmann et al., 2023b; a). A notable breakthrough is the high-fidelity video diffusion models (OpenAI, 2024; Kuaishou, 2024; Luma, 2024; Runway, 2024) by scaling up DiT pre-training (Peebles & Xie, 2023), but they induce significant training costs for long videos. An alternative line of research integrates diffusion models with autoregressive modeling (Chen et al., 2024a; Valevski et al., 2024) to natively support long video generation, but is still limited in context length and training efficiency. Our work advances both approaches in terms of efficiency from a compression perspective, featuring a spatially compressed pyramidal flow and a temporally compressed pyramidal history.

Image Pyramids (Adelson et al., 1984) have been studied extensively in visual representation learning (Lowe, 2004; Dalal & Triggs, 2005; Lin et al., 2017; Wang et al., 2020). For generative models, the idea is explored by cascaded diffusion models that first generate at low resolution and then perform super-resolution (Ho et al., 2022b; Saharia et al., 2022; Pernias et al., 2024; Teng et al., 2024). It is also extended to video by spatiotemporal super-resolution (Ho et al., 2022a; Singer et al., 2023). However, they require training several separate models, which prevents knowledge sharing. Possible unified modeling solutions for pyramids include hierarchical architectures (Rombach et al., 2022; Crowson et al., 2024; Hatamizadeh et al., 2024) or via next-token prediction (Yu et al., 2023; Tian et al., 2024), but involve architectural changes. Instead, we propose a simple flow matching objective that allows joint training of pyramids, thus facilitating efficient video generative modeling.

Method

This work proposes an efficient video generative modeling scheme named pyramidal flow matching. In the following text, we first extend the flow matching algorithm (Section 3.1) to an efficient spatial pyramid representation (Section 3.2). Then, a temporal pyramid design is proposed in Section 3.3 to further improve training efficiency. Lastly, practical implementations are discussed in Section 3.4.

Similar to diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020), flow generative models (Papamakarios et al., 2021; Song et al., 2021; Xu et al., 2022; Lipman et al., 2023; Liu et al., 2023; Albergo & Vanden-Eijnden, 2023) aim to learn a velocity field ${\bm{v}}_{t}$ that maps random noise ${\bm{x}}_{0}\sim\mathcal{N}(\bm{0},{\bm{I}})$ to data samples ${\bm{x}}_{1}\sim q$ via an ordinary differential equation (ODE):

Recently, Lipman et al. (2023); Liu et al. (2023); Albergo & Vanden-Eijnden (2023) proposed the flow matching framework, which provides a simple simulation-free training objective for flow generative models by directly regressing the velocity ${\bm{v}}_{t}$ on a conditional vector filed ${\bm{u}}_{t}(\cdot|{\bm{x}}_{1})$ :

where ${\bm{u}}_{t}(\cdot|{\bm{x}}_{1})$ uniquely determines a conditional probability path $p_{t}(\cdot|{\bm{x}}_{1})$ toward data sample ${\bm{x}}_{1}$ . An effective choice of the conditional probability path is linear interpolation of data and noise:

and ${\bm{u}}({\bm{x}}_{t}|{\bm{x}}_{1})={\bm{x}}_{1}-{\bm{x}}_{0}$ . Notably, flow matching can be flexibly extended to interpolate between distributions other than standard Gaussians. This enables us to devise a sophisticated flow matching algorithm that specializes in reducing the computational cost of video generative modeling.

2 Pyramidal Flow Matching

The main challenge in video generative modeling is the spatio-temporal complexity, and we address its spatial complexity first. According to previous key observation in Fig. 1, the initial generation steps are usually very noisy and less informative, and thus may not need to operate at full resolution. This motivates us to study a spatially compressed pyramidal flow, illustrated in Fig. 2.

To alleviate redundant computation in early steps, we interpolate flow between data and compressed low-resolution noise. Let $\oplus$ denote the interpolation between latents of different resolutions, and let there be $K$ resolutions, each halving the previous one, then our flow may be expressed as:

where $\mathit{Down}(\cdot,\cdot)$ is a downsampling function. Since the interpolation concerns varying-dimensional ${\bm{x}}_{t}$ , we decompose it as a piecewise flow (Yan et al., 2024) that divides $ $into$ K $time windows, where each window interpolates between successive resolutions. For the$ k $-th time window$ [s_{k},e_{k}] $, let$ t^{\prime}=(t-s_{k})/(e_{k}-s_{k})$ denote the rescaled timestep, then the flow within it follows:

where $\mathit{Up}(\cdot)$ is an upsampling function. This way, only the last stage is performed at full resolution, while most stages are performed at lower resolutions using less computation. Under a uniform stage partitioning, the idea of spatial pyramid reduces the computational cost to a factor of nearly $1/K$ . Below, we describe the instantiation of pyramidal flow from training and inference, respectively.

In the construction of pyramidal flow, our main concern is unified modeling of different stages, as previous works (Ho et al., 2022b; Pernias et al., 2024; Teng et al., 2024) all require training multiple models for separate generation and super-resolution, which hinders knowledge sharing.

To unify the objectives of generation and decompression/super-resolution, we curate the probability path by interpolating between different noise levels and resolutions. It starts with a more noisy and pixelated latent upsampled from a lower resolution, and yields cleaner and fine-grained results at a higher resolution, as illustrated in Fig. 2(a). Formally, the conditional probability path is defined by:

where $s_{k}<e_{k}$ , and the upsampling and downsampling functions for the clean ${\bm{x}}_{1}$ are well defined, e.g., by nearest or bilinear resampling. In addition, to enhance the straightness of the flow trajectory, we couple the sampling of its endpoints by enforcing the noise to be in the same direction. Namely, we first sample a noise ${\bm{n}}\sim\mathcal{N}(\bm{0},{\bm{I}})$ and then jointly compute the endpoints $(\hat{\bm{x}}_{e_{k}},\hat{\bm{x}}_{s_{k}})$ as:

Thereafter, we can regress the flow model ${\bm{v}}_{t}$ on the conditional vector field ${\bm{u}}_{t}(\hat{\bm{x}}_{t}|{\bm{x}}_{1})=\hat{\bm{x}}_{e_{k}}-\hat{\bm{x}}_{s_{k}}$ with the following flow matching objective to unify generation and decompression:

2.2 Inference with Renoising

During inference, standard sampling algorithms can be applied within each pyramid stage. However, we must carefully handle the jump points (Campbell et al., 2023) between successive pyramid stages of different resolutions to ensure continuity of the probability path.

To ensure continuity, we first upsample the previous low-resolution endpoint with nearest or bilinear resampling. The result, as a linear combination of the input, follows a Gaussian distribution:

where ${\bm{\Sigma}}$ is a covariance matrix depending on the upsampling function. Comparing Eqs. 8 and 12, we find it possible to match the Gaussian distributions at each jump point by a linear transformation of the upsampled result. Specifically, the following rescaling and renoising scheme would suffice:

where the rescaling coefficient $s_{k}/e_{k+1}$ allows matching the means of these distributions, and the corrective noise ${\bm{n}}^{\prime}$ with a weight of $\alpha$ allows matching their covariance matrices.

To derive the corrective noise and its covariance, we consider a simplest scenario with nearest neighbor upsampling. In this case, ${\bm{\Sigma}}$ has a blockwise structure with non-zero elements only in the $4\times 4$ blocks along the diagonal (corresponding to those upsampled from the same pixel). Then, it can be inferred that the corrective noise’s covariance matrix ${\bm{\Sigma}}^{\prime}$ also has a blockwise structure:

where ${\bm{\Sigma}}^{\prime}_{\mathit{block}}$ contains negative elements $\gamma\in[-1/3,0]$ The lower bound $-1/3$ ensures that the covariance matrix is semidefinite. to reduce the correlation within each block, as illustrated in Fig. 2(b). Since it is desirable to maximally preserve the signals at each jump point, we opt to add a small amount of noise with $\gamma=-1/3$ such that it is most specialized for decorrelation. Substituting this into the above gives the update rule at jump points (see Appendix A for derivations):

with $e_{k+1}=2s_{k}/(1+s_{k})$ . The resulting inference process with renoising is shown in Algorithm 1.

3 Pyramidal Temporal Condition

Beyond the spatial complexity addressed in above sections, video presents another significant challenge due to its temporal length. The prevailing full-sequence diffusion methods generate all video frames simultaneously, restricting them to fixed-length generation (consistent with training). In contrast, the autoregressive video generation paradigm supports flexible-length generation during inference. Recent advancements (Chen et al., 2024a; Valevski et al., 2024) have also demonstrated its effectiveness in creating long-duration video content. However, their training is still severely limited by the computational complexity arising from the full-resolution long-history condition.

We observe that there is a high redundancy in full-resolution history conditions. For example, earlier frames in a video tend to provide high-level semantic conditions and are less related to appearance details. This motivates us to use compressed, lower-resolution history for autoregressive video generation. As shown in Fig. 3(a), we adopt a history condition of gradually increasing resolutions:

The above design significantly reduces the computational and memory overhead of video generative pre-training. Let there be $T$ history latents over $K$ lower resolutions, then most frames are computed at the lowest resolution of $1/2^{K}$ , which reduces the number of training tokens by up to $1/4^{K}$ times. As a result, training efficiency is improved by up to $16^{K}/T$ times.

4 Practical Implementation

In this section, we show that the above pyramid designs can be easily implemented using standard Transformer architecture (Vaswani et al., 2017) and pipelines. This is crucial for efficient and scalable video generative pre-training based on existing acceleration frameworks.

Unlike previous methods (Ma et al., 2024) that utilize factorized spatial and temporal attention to reduce computational complexity, we directly employ full sequence attention, thanks to much fewer tokens required by our pyramidal representation. Furthermore, blockwise causal attention is adopted in each transformer layer, ensuring that each token cannot attend to its subsequent frames. The ablation results in Section C.2 illustrate that such casual attention design is crucial for autoregressive video generation. Another important design choice is the position encoding, as the pyramid designs introduce multiple spatial resolutions. As shown in Fig. 3(b), we extrapolate position encoding in the spatial pyramid for better fine-grained detail (Yang et al., 2024), while interpolating it in the temporal pyramid input to spatially align the history conditions.

During training, different pyramidal stages are uniformly sampled in each update iteration. The autoregressive nature of our method inherently supports joint training of images and videos, since the first frame in a video acts as an image. We pack training samples with varying token counts together to form the length-balanced training batch following Patch n’ Pack (Dehghani et al., 2023). After training, our method natively possesses the capability of text-to-video and text-conditioned image-to-video generation. During inference sampling, the classifier-free guidance strategy can be employed to enhance temporal consistency and motion smoothness of the generated video.

Experiments

Training Dataset. Our model is trained on a mixed corpus of open-source image and video datasets. For images, we utilize a high-aesthetic subset of LAION-5B (Schuhmann et al., 2022), 11M from CC-12M (Changpinyo et al., 2021), 6.9M non-blurred subset of SA-1B (Kirillov et al., 2023), 4.4M from JourneyDB (Sun et al., 2023), and 14M publicly available synthetic data. For video data, we incorporate the WebVid-10M (Bain et al., 2021), OpenVid-1M (Nan et al., 2024), and another 1M high-resolution non-watermark video primarily from the Open-Sora Plan (PKU-Yuan Lab et al., 2024). After postprocessing, around 10M single-shot videos are available for training.

Evaluation Metrics. We utilize the VBench (Huang et al., 2024) and EvalCrafter (Liu et al., 2024) for quantitative performance evaluation. VBench is a comprehensive benchmark that includes 16 fine-grained dimensions to systematically measure both motion quality and semantic alignment of video generative models. EvalCrafter is another large-scale evaluation benchmark including around 17 objective metrics for assessing video generation capabilities. In addition to automated evaluation metrics, we also conducted a study with human participants to measure the human preference for our generated videos. The compared baselines are summarized in Appendix B.

Implementation Details. We utilize the prevailing MM-DiT architecture from SD3 Medium (Esser et al., 2024) as the base model, with 2B parameters in total. It employs sinusoidal position encoding (Vaswani et al., 2017) in the spatial dimensions. As for the temporal dimension, the 1D Rotary Position Embedding (RoPE) (Su et al., 2024) is added to support flexible training with different video durations. In addition, we use a 3D Variational Autoencoder (VAE) to compress videos both spatially and temporally with a downsampling ratio of $8\times 8\times 8$ . It shares a similar structure with MAGVIT-v2 (Yu et al., 2024) and is trained from scratch on the WebVid-10M dataset (Bain et al., 2021). The number of pyramid stages is set to 3 in all the experiments. Following Valevski et al. (2024), we add some corruptive noise of strength uniformly sampled from $[0,1/3]$ to the history pyramid conditions, which is critical for mitigating the autoregressive generation degradation.

2 Efficiency

The proposed pyramidal flow matching framework significantly reduces the computational and memory overhead in video generation training. Consider a video with $T$ frame latents, where each frame contains $N$ tokens at the original resolution. The full-sequence diffusion has $TN$ input tokens in DiT and requires $T^{2}N^{2}$ computations. In contrast, our method uses only approximately $TN/4^{K}$ tokens and $T^{2}N^{2}/16^{K}$ computations even for the final pyramid stage, which significantly improves the training efficiency. Specifically, it takes only 20.7k A100 GPU hours to train a 10s video generation model with 241 frames. Compared to existing models that require significant training resources, our method achieves superior video generation performance with much fewer computations. For example, the Open-Sora 1.2 (Zheng et al., 2024) requires 4.8k Ascend and 37.8k H100 hours to train the generation of only 97 video frames, consuming more than two times the computation of our approach, yet producing videos of worse quality. At inference, our model takes just 56 seconds to create a 5-second, 384p video clip, which is comparable to full-sequence diffusion counterparts.

3 Main Results

Text-to-Video Generation. We first evaluate the text-to-video generation capability of the proposed method. For each text prompt, a 5-second 121 frames video is generated for evaluation. The detailed quantitative results on VBench (Huang et al., 2024) and EvalCrafter (Liu et al., 2024) are summarized in Tables 1 and 2, respectively. Overall, our method surpasses all the compared open-sourced video generation baselines in these two benchmarks. Even with only publicly accessible video data in training, it achieves comparable performance to commercial competitors trained on much larger proprietary data like Kling (Kuaishou, 2024) and Gen-3 Alpha (Runway, 2024). In particular, we demonstrated exceptional performance in quality score (84.74 vs. 84.11 of Gen-3), and motion smoothness in VBench, which are crucial criteria in reflecting the visual quality of generated videos. When evaluated in EvalCrafter, our method achieves better visual and motion quality scores than most compared methods. The semantic score is relatively lower than others, mainly because we use coarse-grained synthetic captions,

which can be improved with more accurate video captioning. We also present some generated 5–10 second videos in Fig. 5, showing cinematic visual quality and validate the efficacy of pyramidal flow matching. More visualizations are provided in Section C.3.

User study. While quantitative evaluation scores reflect the video generation capability to some extent, they may not align with human preferences for visual quality. Hence, an additional user study is conducted to compare our performance with six baseline models, including CogVideoX (Yang et al., 2024) and Kling (Kuaishou, 2024). We utilized 50 prompts sampled from VBench and asked 20+ participants to rank each model according to the aesthetic quality, motion smoothness, and semantic alignment of the generated videos. As seen in Fig. 4, our method is preferred over open-source models such as Open-Sora and CogVideoX-2B especially in terms of motion smoothness. This is due to the substantial token savings achieved by pyramidal flow matching, enabling generation of 5-second (up to 10-second) 768p videos at 24 fps, while the baselines usually support video synthesis of similar length only at 8 fps. The detailed user study settings are presented in Appendix B.

Image-to-Video Generatetion. Thanks to the autoregressive property of our model and the causal attention design, the first frame of each video acts similarly to an image condition during the training. Consequently, although our model is optimized solely for text-to-video generation, it naturally accommodates text-conditioned image-to-video generation during inference. Given an image and a textual prompt, it is able to animate the static input image by autoregressively predicting the future frames without further fine-tuning. In Fig. 6, we illustrate qualitative examples of its image-to-video generation performance, where each example consists of 120 newly synthesized frames spanning a duration of 5 seconds. As can be seen, our model successfully predicts reasonable subsequent motion, endowing the images with rich temporal dynamic information. More generated video examples are best viewed on our project page at https://pyramid-flow.github.io.

4 Ablation Study

In this section, we conduct ablation studies to validate the crucial component of our methods, including the spatial pyramid in denoising trajectory and the temporal pyramid in history condition. Due to limited space, the ablations for other design choices are provided in Section C.2.

Effectiveness of spatial pyramid. In the generation trajectory of the proposed spatial pyramid, only the final stage operates at full resolution, which significantly reduces the number of tokens for most denoising timesteps. With the same computational resources, it can handle more samples per training batch, greatly enhancing the convergence rate. To validate its efficiency, we designed a baseline that employs the standard flow matching objective for training text-to-image generation in our early experiments. This baseline is optimized using the same training data, number of tokens per batch, hyperparameter configurations, and model architecture to ensure fairness. The performance comparison is illustrated in Fig. 7. It can be observed that the variant using pyramidal flow demonstrates superior visual quality and prompt-following capability. We further quantitatively evaluate the FID metric of these methods on the MS-COCO benchmark (Lin et al., 2014) by randomly sampling 3K prompts. The FID performance curve over training steps is presented on the right of Fig. 7. Compared to standard flow matching, the convergence rate of our method is significantly improved.

Effectiveness of temporal pyramid. As mentioned in Section 4.2, the temporal pyramid design can drastically reduce the computation demands compared to traditional full-sequence diffusion. Similar to the spatial pyramid, we also established a full-sequence diffusion baseline under the same experimental setting to investigate its training efficiency improvement. The qualitative comparison with the baseline is presented in Fig. 8, where the generated videos of our pyramidal variant demonstrate much better visual quality and temporal consistency under the same training steps. In contrast, the full-sequence diffusion baseline is far from convergence. It fails to produce coherent motion, leading to fragmented visual details and severe artifacts in the generated videos. This performance gap clearly highlights the training acceleration achieved by our method in video generative modeling.

Conclusion

This work presents an efficient video generative modeling framework based on pyramidal visual representations. In contrast to cascaded diffusion models that use separate models for different image pyramids to improve efficiency, we propose a unified pyramidal flow matching objective that simultaneously generates and decompresses visual content across pyramid stages with a single model, effectively facilitating knowledge sharing. Furthermore, a temporal pyramid design is introduced to reduce computational redundancy in the full-resolution history of a video. The proposed method is extensively evaluated on VBench and EvalCrafter, demonstrating its advantageous performance. All code and model weights will be open-sourced at https://pyramid-flow.github.io.

Acknowledgements. We thank Jinghan Li for assisting with the user study.

References

Appendix A Derivation

This section provides detailed derivation for Eq. 15 that handles jump points in the spatial pyramid.

To ensure continuity of the probability path across different stages of the spatial pyramid, we need to make sure that the endpoints have the same probability distribution. According to Eqs. 8 and 12, their distributions are already similar after a simple upsampling transformation:

Therefore, we can directly apply a linear transformation with a corrective Gaussian noise to match their distributions:

where the rescaling coefficient $s_{k}/e_{k+1}$ allows the means of these distributions to be matched, and $\alpha$ is the noise weight. Additionally, we need to match the covariance matrices of Eqs. 20 and 18:

To allow analysis of covariance matrices, e.g. ${\bm{\Sigma}}$ , we consider a simplest scenario with nearest neighbor upsampling. In this case, ${\bm{\Sigma}}$ has a blockwise structure with non-zero elements only in the $4\times 4$ blocks along the diagonal (corresponding to those upsampled from the same pixel). Then, it can be inferred that the corrective noise’s covariance matrix ${\bm{\Sigma}}^{\prime}$ has a similar blockwise structure:

where $\gamma$ is a negative value in $[-1/3,0]$ for the decorrelation (its lower bound $-1/3$ ensures that the covariance matrix is semidefinite). We further rewrite Eqs. 22 and 21 by considering the equality of their diagonal and non-diagonal elements, respectively:

Taking into account the timestep constraints $0<s_{k},e_{k+1}<1$ , they can be solved directly:

Intuitively, it is desirable to maximally preserve the signals at each jump point, which corresponds to minimizing the noise weight $\alpha$ . According to Eq. 25, this is equivalent to minimizing $\gamma$ . Substituting its minimum value $\gamma=-1/3$ into Eq. 25 yields:

It is worth noting that $e_{k+1}>s_{k}$ , indicating that the timestep is rolled back a bit when adding the corrective noise at each jump point. We can further obtain the renoising rule in Eq. 15:

Appendix B Experimental Settings

Model Implementation Details. We adopt the MM-DiT architecture, based on SD3 Medium (Esser et al., 2024), which comprises 24 transformer layers and a total of 2B parameters. The weights of the MM-DiT are initialized from the SD3 medium. Following the more recent FLUX.1 (Black Forest Labs, 2024), both T5 (Raffel et al., 2020) and CLIP (Radford et al., 2021) encoders are employed for prompts embedding. To address the redundancy in video data, we have designed a 3D VAE that compresses videos both spatially and temporally into a latent space. The architecture of this VAE is similar to MAGVIT-v2 (Yu et al., 2024), employing 3D causal convolution to ensure that each frame depends only on the preceding frames. It features an asymmetric encoder-decoder with Kullback-Leibler (KL) regularization applied to the latents. Overall, the 3D VAE achieves a compression rate of $8\times 8\times 8$ from pixels to the latent. It is trained on WebVid-10M and 6.9M SAM images from scratch. To support the tokenization of very long videos, we scatter them into multiple GPUs to distribute computation like CogVideoX (Yang et al., 2024).

Training Procedure Our model undergoes a three-stage training procedure using 128 NVIDIA A100 GPUs. (1) Image Training. In the first stage, we utilize a pure image dataset that includes 180M images from LAION-5B (Schuhmann et al., 2022), 11M from CC-12M (Changpinyo et al., 2021), 6.9M non-blurred images from SA-1B (Kirillov et al., 2023), and 4.4M from JourneyDB (Sun et al., 2023). We keep the image’s original aspect ratio and rearrange them into different buckets. It is trained for a total of 50,000 steps, requiring approximately 1536 A100 GPU hours. After this stage, the model has learned the dependencies between visual pixels, which facilitates the convergence of subsequent video training. (2) Low-Resolution Video Training. For this stage, we employ the WebVid-10M (Bain et al., 2021), OpenVid-1M (Nan et al., 2024), and another 1M non-watermark video from the Open-Sora Plan (PKU-Yuan Lab et al., 2024). We also leverage the Video-LLaMA2 (Cheng et al., 2024), a state-of-the-art video understanding model, to recaption each video sample. The image data from stage 1 is also utilized at a proportion of 12.5% in each batch. We first train the model for 80,000 steps on 2-second video generation, followed by an additional 120,000 steps on 5-second videos. In total, it takes about 11,520 A100 GPU hours at this stage. (3) High-Resolution Video Training. The final stage employs the same strategy to continue fine-tuning the model on the aforementioned high-resolution video dataset of varying durations (5–10s). It consumes approximately 7,680 A100 GPU hours for 50,000 steps in the final stage.

Hyperparameters Setting The detailed training hyper-parameter settings for each optimization stage are reported in Table 3.

Baseline Methods. For VBench (Huang et al., 2024), we compare with eight baseline methods, including Open-Sora Plan V1.1 (PKU-Yuan Lab et al., 2024), Open-Sora 1.2 (Zheng et al., 2024), VideoCrafter2 (Chen et al., 2024b), Gen-2 (Runway, 2023), Pika 1.0 (Pika, 2023), T2V-Turbo (Li et al., 2024), CogVideoX (Yang et al., 2024), Kling (Kuaishou, 2024), and Gen-3 Alpha (Runway, 2024). Among them, Open-Sora Plan, Open-Sora, CogVideo-X, Kling and Gen-3 Alpha can generate long videos. For EvalCrafter (Liu et al., 2024), our model is compared to six baselines, including ModelScope (Wang et al., 2023a), Show-1 (Zhang et al., 2024), LaVie (Wang et al., 2023b), VideoCrafter2 (Chen et al., 2024b), Pika 1.0 (Pika, 2023), and Gen-2 (Runway, 2023). The above models are all based on full-sequence diffusion, while our method combines the merits of autoregressive generation and flow generative models to achieve better training efficiency of video generation.

User Study. To complement the quantitative evaluation in the main paper, we conduct a rigorous user study to collect human preferences for these generative models. To accomplish this, we sample 50 prompts from the VBench prompt list and randomly sample one generated video for each prompt from the baseline model. In total, six baseline models are considered, including Open-Sora Plan V1.1 (PKU-Yuan Lab et al., 2024), Open-Sora 1.2 (Zheng et al., 2024), Pika 1.0 (Pika, 2023), CogVideoX-2B and 5B (Yang et al., 2024), and Kling (Kuaishou, 2024). We then pair these results with our generated video and ask the participant to rank their preference among three dimensions: aesthetic quality, motion smoothness, and semantic alignment, each of which represents a crucial aspect of video quality. The interface for the user study is exemplified in Fig. 9, where the user accepts a prompt and two generated videos (with the unnecessary information cropped, such as a watermark indicating which model it belongs to), and chooses between which model is better in the three dimensions. We distribute the user study to more than 20 participants, and collect a total of 1411 valid preference choices, ensuring its effectiveness. The results of this user study are presented in Fig. 4, where our model shows a very competitive performance among the compared baselines.

Appendix C Additional Results

This section provides the full results on VBench (Huang et al., 2024) and EvalCrafter (Liu et al., 2024) as a supplement to the performance comparison in the experiments section of the main paper. The evaluation of our model is performed using 5-second 768p videos generated at 24 fps.

VBench (Huang et al., 2024). The full experimental results on VBench are shown in Table 4. As can be observed, our model achieves leading or highly competitive results among open-source and commercial competitors, especially for the metrics related to motion quality. For example, the dynamic degree metric of our model ranks 2nd among all models at 64.63, validating the effectiveness of our generative model in learning temporal dynamics. For the rest of the metrics, our results are also generally superior to the open-source Open-Sora Plan v1.1 (PKU-Yuan Lab et al., 2024) and Open-Sora 1.2 (Zheng et al., 2024), with significantly lower training computational cost as mentioned earlier. We also note that half of our results even outperformed the recent CogVideoX-5B (Yang et al., 2024), which is based on a larger DiT model, demonstrating its modeling capacity. On the other hand, our model performs relatively inferior on metrics such as color and appearance style, which is more related to the image generation capabilities and finer-grained prompt following. This is largely due to our video captioning procedure based on video LLMs which tends to produce coarse-grained captions, thus dampening these abilities. Nevertheless, thanks to our autoregressive generation framework, which decomposes video generation into first frame generation and subsequent frame generation, these image quality issues can be addressed separately with additional well-captioned image data in future training stages. Similarly, due to the SD3-Medium weight initialization, which is infamous for its human structure, our method achieves a relatively low score in human action, which could be addressed by switching to other base models or training from scratch.

EvalCrafter (Liu et al., 2024). The raw metrics on EvalCrafter are provided in Table 5. Overall, our model delivers highly competitive performance on the majority of metrics, outperforming many previous open-source and closed-source models. In particular, the motion AC score of our method which is relevant to the temporal motion quality ranks 2nd among all methods, justifying the capacity of our pyramid designs to learn complex spatiotemporal patterns in video. Our method also demonstrates superiority over several other metrics related to semantic alignment, including BLIP-BLUE and CLIP score. Placing top two in both metrics among the models compared, including the closed-source Gen-2 (Runway, 2023), confirms the advantages of our model in text-to-video semantic alignment. The only metric where our model performs poorly is face consistency, which is due to the temporal pyramid design adopted for compressing the history condition. We view this as an issue that can potentially be addressed by more sophisticated temporal compression schemes.

C.2 Abaltion Study

In this section, we conduct additional ablation studies of two important design details in our proposed pyramidal flow matching, including the corrective noise added during inference of the spatial pyramid and the blockwise causal attention used for autoregressive video generation.

Role of corrective noise. To study its efficacy in the spatial pyramid, we curate a baseline method that inferences without adding this corrective Gaussian noise. The detailed comparative results of our method against this variant are shown in Fig. 10. While the baseline method has a correct global structure, it fails to produce a fine-grained, high-resolution image with rich details and instead produces a blurred image that suffers from block-like artifacts (better observed when zooming in). This is because applying the upsampling function at the jump points between different pyramid stages of varying resolutions results in excessive correlation between spatially adjacent latent values. In comparison, our generated images have rich details and vivid colors, confirming that the adopted corrective renoising scheme effectively addresses this artifact problem in the spatial pyramid.

Effectiveness of causal attention. In Fig. 11, we study the effect of blockwise causal attention by comparing it to the bidirectional attention used in full-sequence diffusion. While an intuitive understanding might be that bidirectional attention promotes information exchange and increases model capacity, it is understudied for autoregressive video generation. In an early experiment, we trained a baseline model using bidirectional attention across different latent frames, the results of which are visualized in Fig. 11. As can be seen from the sampled keyframes of the 1-second videos, this model suffers from a lack of temporal coherence as the subject in the generated video is constantly changing in shape and color. Meanwhile, our model shows good temporal coherence with reasonable motion. We infer that this is because the history condition in bidirectional attention is influenced by the ongoing generation and thus deviates, whereas the history condition in causal attention is fixed, serving as a predetermined condition and stabilizing the autoregressive generative process.

C.3 Visualization

This section presents additional qualitative results for our text-to-video generation in comparison to the recent leading models including Gen-3 Alpha (Runway, 2024), Kling (Kuaishou, 2024) and CogVideoX (Yang et al., 2024). The uniformly sampled frames from the generated videos are shown in Figs. 13 and 12, in which our videos are generated at 5s, 768p, 24fps. Overall, we observe that despite being trained only on publicly available data and using a small computational budget, our model yields a highly competitive visual aesthetics and motion quality among the baselines.

Specifically, the results highlight the following characteristics of our model: (1) Through generative pre-training, our model is capable of generating videos of cinematic quality and reasonable content. For example, in Fig. 12(a), our generative video shows a mushroom cloud resulting from “a massive explosion” taking place in “the surface of the earth”, creating a sci-fi movie atmosphere. However, the current model is not fully faithful to some prompts such as the “salt desert” in Fig. 12(b), which could be addressed by curating more sophisticated caption data. (2) Despite that our model has only 2B parameters initialized from SD3-Medium (Esser et al., 2024), it clearly outperforms CogVideoX-2B of the same model size with additional training data, and is even comparable to the 5B full version in some aspects. For example, in Figs. 13(a) and 13(b), only our model and its 5B version are capable of generating reasonable sea waves according to the input prompt, while its 2B variant merely illustrates an almost static sea surface. This is largely attributed to our proposed pyramidal flow matching in improving training efficiency. Overall, these results validate the effectiveness of our approach in modeling complex spatiotemporal patterns through the spatial and temporal pyramid designs. Our generated videos are best-viewed at https://pyramid-flow.github.io.

Appendix D Limitations

Our method only supports autoregressive generation and cannot be extended to keyframe interpolation or video interpolation. In addition, we noticed that the temporal pyramid designs to improve training efficiency can sometimes lead to subtle subject inconsistency, especially over the long term. While this is not a prevalent problem, we believe that developing more sophisticated temporal compression methods is critical to the broader applicability of our video generative model.

There are also several issues related to the training data. Since we did not include a prompt rewriting procedure in the data curation, the experimental results are focused on relatively short prompts. Also, due to the data filtering procedure, our model did not learn scene transitions during training. This may be overcome by introducing an additional model as the scene director (Lin et al., 2024).