Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach

cs.CV

Introduction

Driven by advances in generative image modeling with diffusion models , there has been significant recent progress on generative video models both in research and real-world applications Broadly, these models are either trained from scratch or finetuned (partially or fully) from pretrained image models with additional temporal layers inserted . Training is often carried out on a mix of image and video datasets .

While research around improvements in video modeling has primarily focused on the exact arrangement of the spatial and temporal layers , none of the aforementioned works investigate the influence of data selection. This is surprising, especially since the significant impact of the training data distribution on generative models is undisputed . Moreover, for generative image modeling, it is known that pretraining on a large and diverse dataset and finetuning on a smaller but higher quality dataset significantly improves the performance . Since many previous approaches to video modeling have successfully drawn on techniques from the image domain , it is noteworthy that the effect of data and training strategies, i.e., the separation of video pretraining at lower resolutions and high-quality finetuning, has yet to be studied. This work directly addresses these previously uncharted territories.

We believe that the significant contribution of data selection is heavily underrepresented in today’s video research landscape despite being well-recognized among practitioners when training video models at scale. Thus, in contrast to previous works, we draw on simple latent video diffusion baselines for which we fix architecture and training scheme and assess the effect of data curation. To this end, we first identify three different video training stages that we find crucial for good performance: text-to-image pretraining, video pretraining on a large dataset at low resolution, and high-resolution video finetuning on a much smaller dataset with higher-quality videos. Borrowing from large-scale image model training , we introduce a systematic approach to curate video data at scale and present an empirical study on the effect of data curation during video pretraining. Our main findings imply that pretraining on well-curated datasets leads to significant performance improvements that persist after high-quality finetuning.

Drawing on these findings, we apply our proposed curation scheme to a large video dataset comprising roughly 600 million samples and train a strong pretrained text-to-video base model, which provides a general motion representation. We exploit this and finetune the base model on a smaller, high-quality dataset for high-resolution downstream tasks such as text-to-video (see Figure 1, top row) and image-to-video, where we predict a sequence of frames from a single conditioning image (see Figure 1, mid rows). Human preference studies reveal that the resulting model outperforms state-of-the-art image-to-video models.

Furthermore, we also demonstrate that our model provides a strong multi-view prior and can serve as a base to finetune a multi-view diffusion model that generates multiple consistent views of an object in a feedforward manner and outperforms specialized novel view synthesis methods such as Zero123XL and SyncDreamer . Finally, we demonstrate that our model allows for explicit motion control by specifically prompting the temporal layers with motion cues and also via training LoRA-modules on datasets resembling specific motions only, which can be efficiently plugged into the model. To summarize, our core contributions are threefold: (i) We present a systematic data curation workflow to turn a large uncurated video collection into a quality dataset for generative video modeling. Using this workflow, we (ii) train state-of-the-art text-to-video and image-to-video models, outperforming all prior models. Finally, we (iii) probe the strong prior of motion and 3D understanding in our models by conducting domain-specific experiments. Specifically, we provide evidence that pretrained video diffusion models can be turned into strong multi-view generators, which may help overcome the data scarcity typically observed in the 3D domain .

Background

Most recent works on video generation rely on diffusion models to jointly synthesize multiple consistent frames from text- or image-conditioning. Diffusion models implement an iterative refinement process by learning to gradually denoise a sample from a normal distribution and have been successfully applied to high-resolution text-to-image and video synthesis .

In this work, we follow this paradigm and train a latent video diffusion model on our video dataset. We provide a brief overview of related works which utilize latent video diffusion models (Video-LDMs) in the following paragraph; a full discussion that includes approaches using GANs and autoregressive models can be found in App. B. An introduction to diffusion models can be found in App. D.

Video-LDMs train the main generative model in a latent space of reduced computational complexity . Most related works make use of a pretrained text-to-image model and insert temporal mixing layers of various forms into the pretrained architecture. Ge et al. additionally relies on temporally correlated noise to increase temporal consistency and ease the learning task. In this work, we follow the architecture proposed in Blattmann et al. and insert temporal convolution and attention layers after every spatial convolution and attention layer. In contrast to works that only train temporal layers or are completely training-free , we finetune the full model. For text-to-video synthesis in particular, most works directly condition the model on a text prompt or make use of an additional text-to-image prior .

In our work, we follow the former approach and show that the resulting model is a strong general motion prior, which can easily be finetuned into an image-to-video or multi-view synthesis model. Additionally, we introduce micro-conditioning on frame rate. We also employ the EDM-framework and significantly shift the noise schedule towards higher noise values, which we find to be essential for high-resolution finetuning. See Section 4 for a detailed discussion of the latter.

Pretraining on large-scale datasets is an essential ingredient for powerful models in several tasks such as discriminative text-image and language modeling. By leveraging efficient language-image representations such as CLIP , data curation has similarly been successfully applied for generative image modeling . However, discussions on such data curation strategies have largely been missing in the video generation literature , and processing and filtering strategies have been introduced in an ad-hoc manner. Among the publicly accessible video datasets, WebVid-10M dataset has been a popular choice despite being watermarked and suboptimal in size. Additionally, WebVid-10M is often used in combination with image data , to enable joint image-video training. However, this amplifies the difficulty of separating the effects of image and video data on the final model. To address these shortcomings, this work presents a systematic study of methods for video data curation and further introduces a general three-stage training strategy for generative video models, producing a state-of-the-art model.

Curating Data for HQ Video Synthesis

In this section, we introduce a general strategy to train a state-of-the-art video diffusion model on large datasets of videos. To this end, we (i) introduce data processing and curation methods, for which we systematically analyze the impact on the quality of the final model in Section 3.3 and Section 3.4, and (ii), identify three different training regimes for generative video modeling. In particular, these regimes consist of

Stage I: image pretraining, i.e. a 2D text-to-image diffusion model .

Stage II: video pretraining, which trains on large amounts of videos.

Stage III: video finetuning, which refines the model on a small subset of high-quality videos at higher resolution.

We study the importance of each regime separately in Sections 3.2, 3.3 and 3.4.

We collect an initial dataset of long videos which forms the base data for our video pretraining stage. To avoid cuts and fades leaking into synthesized videos, we apply a cut detection pipelinehttps://github.com/Breakthrough/PySceneDetect in a cascaded manner at three different FPS levels. Figure 2, left, provides evidence for the need for cut detection: After applying our cut-detection pipeline, we obtain a significantly higher number ( $\sim 4\times$ ) of clips, indicating that many video clips in the unprocessed dataset contain cuts beyond those obtained from metadata.

Next, we annotate each clip with three different synthetic captioning methods: First, we use the image captioner CoCa to annotate the mid-frame of each clip and use V-BLIP to obtain a video-based caption. Finally, we generate a third description of the clip via an LLM-based summarization of the first two captions.

The resulting initial dataset, which we dub Large Video Dataset (LVD), consists of 580M annotated video clip pairs, forming 212 years of content.

However, further investigation reveals that the resulting dataset contains examples that can be expected to degrade the performance of our final video model, such as clips with less motion, excessive text presence, or generally low aesthetic value. We therefore additionally annotate our dataset with dense optical flow , which we calculate at 2 FPS and with which we filter out static scenes by removing any videos whose average optical flow magnitude is below a certain threshold. Indeed, when considering the motion distribution of LVD (see Figure 2, right) via optical flow scores, we identify a subset of close-to-static clips therein.

Moreover, we apply optical character recognition to weed out clips containing large amounts of written text. Lastly, we annotate the first, middle, and last frames of each clip with CLIP embeddings from which we calculate aesthetics scores as well as text-image similarities. Statistics of our dataset, including the total size and average duration of clips, are provided in Tab. 1.

2 Stage I: Image Pretraining

We consider image pretraining as the first stage in our training pipeline. Thus, in line with concurrent work on video models , we ground our initial model on a pretrained image diffusion model - namely Stable Diffusion 2.1 - to equip it with a strong visual representation.

To analyze the effects of image pretraining, we train and compare two identical video models as detailed in App. D on a 10M subset of LVD; one with and one without pretrained spatial weights. We compare these models using a human preference study (see App. E for details) in Figure 3(a), which clearly shows that the image-pretrained model is preferred in both quality and prompt-following.

3 Stage II: Curating a Video Pretraining Dataset

A systematic approach to video data curation. For multimodal image modeling, data curation is a key element of many powerful discriminative and generative models. However, since there are no equally powerful off-the-shelf representations available in the video domain to filter out unwanted examples, we rely on human preferences as a signal to create a suitable pretraining dataset. Specifically, we curate subsets of LVD using different methods described below and then consider the human-preference-based ranking of latent video diffusion models trained on these datasets.

More specifically, for each type of annotation introduced in Section 3.1 (i.e., CLIP scores, aesthetic scores, OCR detection rates, synthetic captions, optical flow scores), we start from an unfiltered, randomly sampled 9.8M-sized subset of LVD, LVD-10M, and systematically remove the bottom 12.5, 25 and 50% of examples. Note that for the synthetic captions, we cannot filter in this sense. Instead, we assess Elo rankings for the different captioning methods from Section 3.1. To keep the number of total subsets tractable, we apply this scheme separately to each type of annotation. We train models with the same training hyperparameters on each of these filtered subsets and compare the results of all models within the same class of annotation with an Elo ranking for human preference votes. Based on these votes, we consequently select the best-performing filtering threshold for each annotation type. The details of this study are presented and discussed in App. E. Applying this filtering approach to LVD results in a final pretraining dataset of 152M training examples, which we refer to as LVD-F, cf. Tab. 1.

Curated training data improves performance. In this section, we demonstrate that the data curation approach described above improves the training of our video diffusion models. To show this, we apply the filtering strategy described above to LVD-10M and obtain a four times smaller subset, LVD-10M-F. Next, we use it to train a baseline model that follows our standard architecture and training schedule and evaluate the preference scores for visual quality and prompt-video alignment compared to a model trained on uncurated LVD-10M.

We visualize the results in Figure 3(b), where we can see the benefits of filtering: In both categories, the model trained on the much smaller LVD-10M-F is preferred. To further show the efficacy of our curation approach, we compare the model trained on LVD-10M-F with similar video models trained on WebVid-10M , which is the most recognized research licensed dataset, and InternVid-10M , which is specifically filtered for high aesthetics. Although LVD-10M-F is also four times smaller than these datasets, the corresponding model is preferred by human evaluators in both spatiotemporal quality and prompt alignment as shown in Figures 4(b) and 4(b).

Data curation helps at scale. To verify that our data curation strategy from above also works on larger, more practically relevant datasets, we repeat the experiment above and train a video diffusion model on a filtered subset with 50M examples and a non-curated one of the same size. We conduct a human preference study and summarize the results of this study in Figure 4(c), where we can see that the advantages of data curation also come into play with larger amounts of data. Finally, we show that dataset size is also a crucial factor when training on curated data in Figure 4(d), where a model trained on 50M curated samples is superior to a model trained on LVD-10M-F for the same number of steps.

4 Stage III: High-Quality Finetuning

In the previous section, we demonstrated the beneficial effects of systematic data curation for video pretraining. However, since we are primarily interested in optimizing the performance after video finetuning, we now investigate how these differences after Stage II translate to the final performance after Stage III. Here, we draw on training techniques from latent image diffusion modeling and increase the resolution of the training examples. Moreover, we use a small finetuning dataset comprising 250K pre-captioned video clips of high visual fidelity.

To analyze the influence of video pretraining on this last stage, we finetune three identical models, which only differ in their initialization. We initialize the weights of the first with a pretrained image model and skip video pretraining, a common choice among many recent video modeling approaches . The remaining two models are initialized with the weights of the latent video models from the previous section, specifically, the ones trained on 50M curated and uncurated video clips. We finetune all models for 50K steps and assess human preference rankings early during finetuning (10K steps) and at the end to measure how performance differences progress in the course of finetuning. We show the obtained results in Figure 4(e), where we plot the Elo improvements of user preference relative to the model ranked last, which is the one initialized from an image model. Moreover, the finetuning resumed from curated pretrained weights ranks consistently higher than the one initialized from video weights after uncurated training.

Given these results, we conclude that i) the separation of video model training in video pretraining and video finetuning is beneficial for the final model performance after finetuning and that ii) video pretraining should ideally occur on a large scale, curated dataset, since performance differences after pretraining persist after finetuning.

Training Video Models at Scale

In this section, we borrow takeaways from Section 3 and present results of training state-of-the-art video models at scale. We first use the optimal data strategy inferred from ablations to train a powerful base model at $320\times 576$ in Section D.2. We then perform finetuning to yield several strong state-of-the-art models for different tasks such as text-to-video in Section 4.2, image-to-video in Section 4.3 and frame interpolation in Section 4.4. Finally, we demonstrate that our video-pretraining can serve as a strong implicit 3D prior, by tuning our image-to-video models on multi-view generation in Section 4.5 and outperform concurrent work, in particular Zero123XL and SyncDreamer in terms of multi-view consistency.

As discussed in Section 3.2, our video model is based on Stable Diffusion 2.1 (SD 2.1). Recent works show that it is crucial to adopt the noise schedule when training image diffusion models, shifting towards more noise for higher-resolution images. As a first step, we finetune the fixed discrete noise schedule from our image model towards continuous noise using the network preconditioning proposed in Karras et al. for images of size $256\times 384$ . After inserting temporal layers, we then train the model on LVD-F on 14 frames at resolution $256\times 384$ . We use the standard EDM noise schedule for 150k iterations and batch size 1536. Next, we finetune the model to generate 14 $320\times 576$ frames for 100k iterations using batch size 768. We find that it is important to shift the noise schedule towards more noise for this training stage, confirming results by Hoogeboom et al. for image models. For further training details, see App. D. We refer to this model as our base model which can be easily finetuned for a variety of tasks as we show in the following sections. The base model has learned a powerful motion representation, for example, it significantly outperforms all baselines for zero-shot text-to-video generation on UCF-101 (Tab. 2). Evaluation details can be found in App. E.

2 High-Resolution Text-to-Video Model

We finetune the base text-to-video model on a high-quality video dataset of $\sim$ 1M samples. Samples in the dataset generally contain lots of object motion, steady camera motion, and well-aligned captions, and are of high visual quality altogether. We finetune our base model for 50k iterations at resolution $576\times 1024$ (again shifting the noise schedule towards more noise) using batch size 768. Samples in Figure 5, more can be found in App. E.

3 High Resolution Image-to-Video Model

Besides text-to-video, we finetune our base model for image-to-video generation, where the video model receives a still input image as a conditioning. Accordingly, we replace text embeddings that are fed into the base model with the CLIP image embedding of the conditioning. Additionally, we concatenate a noise-augmented version of the conditioning frame channel-wise to the input of the UNet . We do not use any masking techniques and simply copy the frame across the time axis. We finetune two models, one predicting 14 frames and another one predicting 25 frames; implementation and training details can be found in App. D. We occasionally found that standard vanilla classifier-free guidance can lead to artifacts: too little guidance may result in inconsistency with the conditioning frame while too much guidance can result in oversaturation. Instead of using a constant guidance scale, we found it helpful to linearly increase the guidance scale across the frame axis (from small to high). Details can be found in App. D. Samples in Figure 5, more can be found in App. E.

In Figure 9 we compare our model with state-of-the-art, closed-source video generative models, in particular GEN-2 and PikaLabs , and show that our model is preferred in terms of visual quality by human voters. Details on the experiment, as well as many more image-to-video samples, can be found in App. E.

To facilitate controlled camera motion in image-to-video generation, we train a variety of camera motion LoRAs within the temporal attention blocks of our model ; see App. D for exact implementation details. We train these additional parameters on a small dataset with rich camera-motion metadata. In particular, we use three subsets of the data for which the camera motion is categorized as “horizontally moving”, “zooming”, and “static”. In Figure 7 we show samples of the three models for identical conditioning frames; more samples can be found in App. E.

4 Frame Interpolation

To obtain smooth videos at high frame rates, we finetune our high-resolution text-to-video model into a frame interpolation model. We follow Blattmann et al. and concatenate the left and right frames to the input of the UNet via masking. The model learns to predict three frames within the two conditioning frames, effectively increasing the frame rate by four. Surprisingly, we found that a very small number of iterations ( $\approx 10k$ ) suffices to get a good model. Details and samples can be found in App. D and App. E, respectively.

5 Multi-View Generation

To obtain multiple novel views of an object simultaneously, we finetune our image-to-video SVD model on multi-view datasets .

Datasets. We finetuned our SVD model on two datasets, where the SVD model takes a single image and outputs a sequence of multi-view images: (i) A subset of Objaverse consisting of 150K curated and CC-licensed synthetic 3D objects from the original dataset . For each object, we rendered $360^{\circ}$ orbital videos of 21 frames with randomly sampled HDRI environment map and elevation angles between $[-5^{\circ},30^{\circ}]$ . We evaluate the resulting models on an unseen test dataset consisting of 50 sampled objects from Google Scanned Objects (GSO) dataset . and (ii) MVImgNet consisting of casually captured multi-view videos of general household objects. We split the videos into $\sim$ 200K train and 900 test videos. We rotate the frames captured in portrait mode to landscape orientation.

The Objaverse-trained model is additionally conditioned on the elevation angle of the input image, and outputs orbital videos at that elevation angle. The MVImgNet-trained models are not conditioned on pose and can choose an arbitrary camera path in their generations. For details on the pose conditioning mechanism, see App. E.

Models. We refer to our finetuned Multi-View model as SVD-MV. We perform an ablation study on the importance of the video prior of SVD for multi-view generation. To this effect, we compare the results from SVD-MV i.e. from a video prior to those finetuned from an image prior i.e. the text-to-image model SD2.1 (SD2.1-MV), and that trained without a prior i.e. from random initialization (Scratch-MV). In addition, we compare with the current state-of-the-art multiview generation models of Zero123 , Zero123XL , and SyncDreamer .

Metrics. We use the standard metrics of Peak Signal-to-Noise Ratio (PSNR), LPIPS , and CLIP Similarity scores (CLIP-S) between the corresponding pairs of ground truth and generated frames on 50 GSO test objects.

Training. We train all our models for 12k steps ( $\sim$ 16 hours) with 8 80GB A100 GPUs using a total batch size of 16, with a learning rate of 1e-5.

Results. Figure 10(a) shows the average metrics on the GSO test dataset. The higher performance of SVD-MV compared to SD2.1-MV and Scratch-MV clearly demonstrates the advantage of the learned video prior in the SVD model for multi-view generation. In addition, as in the case of other models finetuned from SVD, we found that a very small number of iterations ( $\approx 12k$ ) suffices to get a good model. Moreover, SVD-MV is competitive w.r.t state-of-the-art techniques with lesser training time (12 $k$ iterations in 16 hours), whereas existing models are typically trained for much longer (for example, SyncDreamer was trained for four days specifically on Objaverse). Figure 10(b) shows convergence of different finetuned models. After only 1k iterations, SVD-MV has much better CLIP-S and PSNR scores than its image-prior and no-prior counterparts.

Figure 8 shows a qualitative comparison of multi-view generation results on a GSO test object and Figure 11 on an MVImgNet test object. As can be seen, our generated frames are multi-view consistent and realistic. More details on the experiments, as well as more multi-view generation samples, can be found in App. E.

Conclusion

We present Stable Video Diffusion (SVD), a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video synthesis. To construct its pretraining dataset, we conduct a systematic data selection and scaling study, and propose a method to curate vast amounts of video data and turn large and noisy video collection into suitable datasets for generative video models. Furthermore, we introduce three distinct stages of video model training which we separately analyze to assess their impact on the final model performance. Stable Video Diffusion provides a powerful video representation from which we finetune video models for state-of-the-art image-to-video synthesis and other highly relevant applications such as LoRAs for camera control. Finally we provide a pioneering study on multi-view finetuning of video diffusion models and show that SVD constitutes a strong 3D prior, which obtains state-of-the-art results in multi-view synthesis while using only a fraction of the compute of previous methods.

We hope these findings will be broadly useful in the generative video modeling literature. A discussion on our work’s broader impact and limitations can be found in App. A.

Acknowledgements

Special thanks to Emad Mostaque for his excellent support on this project. Many thanks go to our colleagues Jonas Müller, Axel Sauer, Dustin Podell and Rahim Entezari for fruitful discussions and comments. Finally, we thank Harry Saini and the one and only Richard Vencu for maintaining and optimizing our data and computing infrastructure.

References

Appendix

Appendix A Broader Impact and Limitations

Broader Impact: Generative models for different modalities promise to revolutionize the landscape of media creation and use. While exploring their creative applications, reducing the potential to use them for creating misinformation and harm are crucial aspects before real-world deployment. Furthermore, risk analyses need to highlight and evaluate the differences between the various existing model types, such as interpolation, text-to-video, animation, and long-form generation. Before these models are used in practice, a thorough investigation of the models themselves, their intended uses, safety aspects, associated risks, and potential biases is essential. Limitations: While our approach excels at short video generation, it comes with some fundamental shortcomings w.r.t. long video synthesis: Although a latent approach provides efficiency benefits, generating multiple keyframes at once is expensive both during training but also inference, and future work on long video synthesis should either try a cascade of very coarse frame generation or build dedicated tokenizers for video generation. Furthermore, videos generated with our approach sometimes suffer from too little generated motion. Lastly, video diffusion models are typically slow to sample and have high VRAM requirements, and our model is no exception. Diffusion distillation methods are promising candidates for faster synthesis.

Appendix B Related Work

Video Synthesis. Many approaches based on various models such as variational RNNs , normalizing flows , autoregressive transformers , and GANs have tackled video synthesis. Most of these works, however, have generated videos either on low-resolution or on comparably small and noisy datasets which were originally proposed to train discriminative models.

Driven by increasing amounts of available compute resources and datasets better suited for generative modeling such as WebVid-10M , more competitive approaches have been proposed recently, mainly based on well-scalable, explicit likelihood-based approaches such as diffusion and autoregressive models . Motivated by a lack of available clean video data, all these approaches are leveraging joint image-video training and most methods are grounding their models on pretrained image models . Another commonality between these and most subsequent approaches to (text-to-)video synthesis is the usage of dedicated expert models to generate the actual visual content at a coarse frame rate and to temporally upscale this low-fps video to temporally smooth final outputs at 24-32 fps . Similar to the image domain, diffusion-based approaches can be mainly separated into cascaded approaches following and latent diffusion models translating the approach of Rombach et al. to the video domain. While most of these works aim at learning general motion representation and are consequently trained on large and diverse datasets, another well-recognized branch of diffusion-based video synthesis tackles personalized video generation based on finetuning of pretrained text-to-image models on more narrow datasets tailored to a specific domain or application, partly including non-deep motion priors . Finally, many recent works tackle the task of image-to-video synthesis, where the start frame is already given, and the model has to generate the consecutive frames . Importantly, as shown in our work (see Figure 1) when combined with off-the-shelf text-to-image models, image-to-video models can be used to obtain a full text-(to-image)-to-video pipeline.

Multi-View Generation Motivated by their success in 2D image generation, diffusion models have also been used for multi-view generation. Early promising diffusion-based results have mainly been restricted by lacking availability of useful real-world multi-view training data. To address this, more recent works such as Zero-123 , MVDream , and SyncDreamer propose techniques to adapt and finetune pretrained image generation models such as Stable Diffusion (SD) for multi-view generation, thereby leveraging image priors from SD. One issue with Zero-123 is that the generated multi-views can be inconsistent with respect to each other as they are generated independently with pose-conditioning. Some follow-up works try to address this view-consistency problem by jointly synthesizing the multi-view images. MVDream proposes to jointly generate four views of an object using a shared attention module across images. SyncDreamer proposes to estimate a 3D voxel structure in parallel to the multi-view image diffusion process to maintain consistency across the generated views.

Despite rapid progress in multi-view generation research, these approaches rely on single image generation models such as SD. We believe that our video generative model is a better candidate for the multi-view generation as multi-view images form a specific form of video where the camera is moving around an object. As a result, it is much easier to adapt a video-generative model for multi-view generation compared to adapting an image-generative model. In addition, the temporal attention layers in our video model naturally assist in the generation of consistent multi-views of an object without needing any explicit 3D structures like in .

Appendix C Data Processing

In this section, we provide more details about our processing pipeline including their outputs on a few public video examples for demonstration purposes.

We start from a large collection of raw video data which is not useful for generative text-video (pre)training because of the following adverse properties: First, in contrast to discriminative approaches to video modeling, generative video models are sensitive to motion inconsistencies such as cuts of which usually many are contained in raw and unprocessed video data, cf. Figure 2, left. Moreover, our initial data collection is biased towards still videos as indicated by the peak at zero motion in Figure 2, right. Since generative models trained on this data would obviously learn to generate videos containing cuts and still scenes, this emphasizes the need for cut detection and motion annotations to ensure temporal quality. Another critical ingredient for training generative text-video models are captions - ideally more than one per video - which are well-aligned with the video content. The last essential component for generative video training which we are considering here is the high visual quality of the training examples.

The design of our processing pipeline addresses the above points. Thus, to ensure temporal quality, we detect cuts with a cascaded approach directly after download, clip the videos accordingly, and estimate optical flow for each resulting video clip. After that, we apply three synthetic captioners to every clip and further extract frame-level CLIP similarities to all of these text prompts to be able to filter out outliers. Finally, visual quality at the frame level is assessed by using a CLIP-embeddings-based aesthetics score . We describe each step in more detail in what follows.

Similar to previous work , we use PySceneDetect https://github.com/Breakthrough/PySceneDetect to detect cuts in our base video clips. However, as qualitatively shown in Figure 12 we observe many fade-ins and fade-outs between consecutive scenes, which are not detected when running the cut detector at a unique threshold and only native fps. Thus, in contrast to previous work, we apply a cascade of 3 cut detectors which are operating at different frame rates and different thresholds to detect both sudden changes and slow ones such as fades.

We clip the videos using FFMPEG directly after cut detection by extracting the timestamps of the keyframes in the source videos and snapping detected cuts onto the closest keyframe timestamp, which does not cross the detected cut. This allows us to quickly extract clips without cuts via seeking and isn’t prohibitively slow at scale like inserting new keyframes in each video.

As motivated in Section 3.1 and Figure 2 it is crucial to provide means for filtering out static scenes. To enable this, we extract dense optical flow maps at 2fps using the OpenCV implementation of the Farnebäck algorithm . To further keep storage size tractable we spatially downscale the flow maps such that the shortest side is at 16px resolution. By averaging these maps over time and spatial coordinates, we further obtain a global motion score for each clip, which we use to filter out static scenes by using a threshold for the minimum required motion, which is chosen as detailed on Section E.2.2. Since this only yields rough approximate, for the final Stage III finetuning, we compute more accurate dense optical flow maps using RAFT at $800\times 450$ resolution. The motion scores are then computed similarly. Since the high-quality finetuning data is relatively much smaller than the pretraining dataset, this makes the RAFT-based flow computation tractable.

At a million-sample scale, it is not feasible to hand-annotate data points with prompts. Hence we resort to synthetic captioning to extract captions. However in light of recent insights on the importance of caption diversity and taking potential failure cases of these synthetic captioning models into consideration, we extract three captions per clip by using i) the image-only captioning model CoCa , which describes spatial aspects well, ii) - to also capture temporal aspects - the video-captioner VideoBLIP and iii) to combine these two captions and like that, overcome potential flaws in each of them, a lightweight LLM. Examples of the resulting captions are shown in Figure 14.

Extracting CLIP image and text representations have proven to be very helpful for data curation in the image domain since computing the cosine similarity between the two allows for assessment of text-image alignment for a given example and thus to filter out examples with erroneous captions. Moreover, it is possible to extract scores for visual aesthetics . Although CLIP is only able to process images, and this consequently is only possible on a single frame level we opt to extract both CLIP-based i) text-image similarities and ii) aesthetics scores of the first, center, and last frames of each video clip. As shown in Sections 3.3 and E.2.2, using training text-video models on data curated by using these scores improves i) text following abilities and ii) visual quality of the generated samples compared to models trained on unfiltered data.

In early experiments, we noticed that models trained on earlier versions of LVD-F obtained a tendency to generate videos with excessive amounts of written text depicted which is arguably not a desired feat for a text-to-video model. To this end, we applied the off-the-shelf text-detector CRAFT to annotate the start, middle, and end frames of each clip in our dataset with bounding box information on all written text. Using this information, we filtered out all clips with a total area of detected bounding boxes larger than 7% to construct the final LVD-F.

Appendix D Model and Implementation Details

where $\nabla_{\mathbf{x}}\log p({\mathbf{x}};\sigma)$ is the score function . DM training reduces to learning a model ${\bm{s}}_{\bm{\theta}}({\mathbf{x}};\sigma)$ for the score function $\nabla_{\mathbf{x}}\log p({\mathbf{x}};\sigma)$ . The model can, for example, be parameterized as $\nabla_{\mathbf{x}}\log p({\mathbf{x}};\sigma)\approx s_{\bm{\theta}}({\mathbf{x}};\sigma)=(D_{\bm{\theta}}({\mathbf{x}};\sigma)-{\mathbf{x}})/\sigma^{2}$ , where $D_{\bm{\theta}}$ is a learnable denoiser that tries to predict the clean ${\mathbf{x}}_{0}$ . The denoiser $D_{\bm{\theta}}$ is trained via denoising score matching (DSM)

where $p(\sigma,{\mathbf{n}})=p(\sigma)\,{\mathcal{N}}\left({\mathbf{n}};\bm{0},\sigma^{2}\right)$ , $p(\sigma)$ can be a probability distribution or density over noise levels $\sigma$ . It is both possible to use a discrete set or a continuous range of noise levels. In this work, we use both options, which we further specify in Section D.2.

where $F_{\bm{\theta}}$ is the network to be trained.

Classifier-free guidance. Classifier-free guidance is a method used to guide the iterative refinement process of a DM towards a conditioning signal ${\mathbf{c}}$ . The main idea is to mix the predictions of a conditional and an unconditional model

where $w\geq 0$ is the guidance strength. The unconditional model can be trained jointly alongside the conditional model in a single network by randomly replacing the conditional signal ${\mathbf{c}}$ with a null embedding in Eq. 2, e.g., 10% of the time . In this work, we use classifier-free guidance, for example, to guide video generation toward text conditioning.

D.2 Base Model Training and Architecture

As discussed in , we start the publicly available Stable Diffusion 2.1 (SD 2.1) model. In the EDM-framework , SD 2.1 has the following preconditioning functions:

where $\sigma_{j+1}>\sigma_{j}$ . The distribution over noise levels $p(\sigma)$ used for the original SD 2.1. training is a uniform distribution over the 1000 discrete noise levels $\{\sigma_{j}\}_{j\in}$ . One issue with the training of SD 2.1 (and in particular its noise distribution $p(\sigma)$ ) is that even for the maximum discrete noise level $\sigma_{1000}$ the signal-to-noise ratio is still relatively high which results in issues when, for example, generating very dark images . Guttenberg and CrossLabs proposed offset noise, a modification of the training objective in Eq. 2 by making $p({\mathbf{n}}\mid\sigma)$ non-isotropic Gaussian. In this work, we instead opt to modify the preconditioning functions and distribution over training noise levels altogether.

Image model finetuning. We replace the above preconditioning functions with

D.3 High-Resolution Text-to-Video Model

D.4 High-Resolution Image-to-Video Model

We occasionally found that standard vanilla classifier-free guidance (see Eq. 4) can lead to artifacts: too little guidance may result in inconsistency with the conditioning frame while too much guidance can result in oversaturation. Instead of using a constant guidance scale, we found it helpful to linearly increase the guidance scale across the frame axis (from small to high). A PyTorch implementation of this novel technique can be found in Figure 16.

D.4.2 Camera Motion LoRA

To facilitate controlled camera motion in image-to-video generation, we train a variety of camera motion LoRAs within the temporal attention blocks of our model . In particular, we train low-rank matrices of rank 16 for 5k iterations. Additional samples can be found in Figure 21.

D.5 Interpolation Model Details

D.6 Multi-view generation

We finetuned the high-resolution image-to-video model on our specific rendering of the Objaverse dataset. We render 21 frames per orbit of an object in the dataset at $576\times 576$ resolution and finetune the 25-frame Image-to-Video model to generate these 21 frames. We feed one view of the object as the image condition. In addition, we feed the elevation of the camera as conditioning to the model. We first pass the elevation through a timestep embedding layer that embeds the sine and cosine of the elevation angle at various frequencies and concatenates them into a vector. This vector is finally concatenated to the overall vector condition of the UNet.

We trained for 12 $k$ iterations with a total batch size of 16 across 8 A100 GPUs of 80GB VRAM at a learning rate of $1\times 10^{-5}$ .

Appendix E Experiment Details

For most of the evaluation conducted in this paper, we employ human evaluation as we observed it to contain the most reliable signal. For text-to-video tasks and all ablations conducted for the base model, we generate video samples from a list of 64 test prompts. We then employ human annotators to collect preference data on two axes: i) visual quality and ii) prompt following. More details on how the study was conducted Section E.1.1 and the rankings computed Section E.1.2 are listed below.

Given all models in one ablation axis (e.g. four models of varying aesthetic or motion scores), we compare each prompt for each pair of models (1v1). For every such comparison, we collect on average three votes per task from different annotators, i.e., three each for visual quality and prompt following, respectively. Performing a complete assessment between all pair-wise comparisons gives us robust and reliable signals on model performance trends and the effect of varying thresholds. Sample interfaces that the annotators interact with are shown in Figure 17. The order of prompts and the order between models are fully randomized. Frequent attention checks are in place to ensure data quality.

E.1.2 Elo Score Calculation

To calculate rankings when comparing more than two models based on 1v1 comparisons as outlined in Section E.1.1, we use Elo Scores (higher-is-better) , which were originally proposed as a scoring method for chess players but have more recently also been applied to compare instruction-tuned generative LLMs . For a set of competing players with initial ratings $R_{\text{init}}$ participating in a series of zero-sum games, the Elo rating system updates the ratings of the two players involved in a particular game based on the expected and actual outcome of that game. Before the game with two players with ratings $R_{1}$ and $R_{2}$ , the expected outcome for the two players is calculated as

After observing the result of the game, the ratings $R_{i}$ are updated via the rule

where $S_{i}$ indicates the outcome of the match for player $i$ . In our case, we have $S_{i}=1$ if player $i$ wins and $S_{i}=0$ if player $i$ loses. The constant $K$ can be seen as weight emphasizing more recent games. We choose $K=1$ and bootstrap the final Elo ranking for a given series of comparisons based on 1000 individual Elo ranking calculations in a randomly shuffled order. Before comparing the models, we choose the start rating for every model as $R_{\text{init}}=1000$ .

E.2 Details on Experiments from Section 3

Architecturally, all models trained for the presented analysis in Section 3 are identical. To insert create a temporal UNet based on an existing spatial model, we follow Blattmann et al. and add temporal convolution and (cross-)attention layers after each corresponding spatial layer. As a base 2D-UNet, we use the architecture from Stable Diffusion 2.1, whose weights we further use to initialize the spatial layers for all runs except the second one presented in Figure 3(a), where we intentionally skip this initialization to create a baseline for demonstrating the effect of image-pretraining. Unlike Blattmann et al. , we train all layers, including the spatial ones, and do not freeze the spatial layers after initialization. All models are trained with the AdamW optimizer with a learning rate of $1.e-4$ and a batch size of $256$ . Moreover, in contrast to our models from Section 4, we do not translate the noise process to continuous time but use the standard linear schedule used in Stable Diffusion 2.1, including offset noise , in combination with the v-parameterization . We omit the text-conditioning in 10% of the cases to enable classifier-free guidance during inference. To generate samples for the evaluations, we use 50 steps of the deterministic DDIM sampler with a classifier guidance scale of 12 for all models.

E.2.2 Calibrating Filtering Thresholds

Here, we present the outcomes of our study on filtering thresholds presented in Section 3.3. As stated there, we conduct experiments for the optimal filtering threshold for each type of annotation while not filtering for any other types. The only difference here is our assessment of the most suitable captioning method, where we simply compare all used captioning methods. We train each model on videos consisting of 8 frames at resolution $256\times 256$ for exactly 40k steps with a batch size of 256, roughly corresponding to 10M training examples seen during training. For evaluation, we create samples based on 64 pre-selected prompts for each model and conduct a human preference study as detailed in Section E.1. Figure 18 shows the ranking results of these human preference studies for each annotation axis for spatiotemporal sample quality and prompt following. Additionally, we show an averaged ‘aggregated’ score.

For captioning, we see that - surprisingly - the captions generated by the simple clip-based image captioning method CoCa of Yu et al. clearly have the most beneficial influence on the model. However, since recent research recommends using more than one caption per training example, we sample one of the three distinct captions during training. We nonetheless reflect the outcome of this experiment by shifting the captioning sampling distribution towards CoCa captions by using $p_{\text{CoCa}}=0.5;\,p_{\text{V-BLIP}}=0.25;\,p_{\text{LLM}}=0.25;\,$ .

For motion filtering, we choose to filter out 25% of the most static examples. However, the aggregated preference score of the model trained with this filtering method does not rank as high in human preference as the non-filtered score. The rationale behind this is that non-filtered ranks best primarily because it ranks best in the category ‘prompt following’ which is less important than the ‘quality’ category when assessing the effect of motion filtering. Thus, we choose the 25% threshold, as mentioned above, since it achieves both competitive performances in ‘prompt following’ and ‘quality’.

For aesthetics filtering, where, as for motion thresholding, the ‘quality’ category is more important than the ‘prompt following’-category, we choose to filter out the 25 % with the lowest aesthetics score, while for CLIP-score thresholding we omit even 50% since the model trained with the corresponding threshold is performing best. Finally, we filter out the 25% of samples with the largest text area covering the videos since it ranks highest both in the ‘quality’ category and on average.

Using these filtering methods, we reduce the size of LVD by more than a factor of 3, cf. Tab. 1, but obtain a much cleaner dataset as shown in Section 3. For the remaining experiments in Section 3.3, we use the identical architecture and hyperparameters as stated above. We only vary the dataset as detailed in Section 3.3.

E.2.3 Finetuning Experiments

For the finetuning experiments shown in Section 3.4, we again follow the architecture, training hyperparameters, and sampling procedure stated at the beginning of this section. The only notable differences are the exchange of the dataset and the increase in resolution from the pretraining resolution $256\times 256$ to $512\times 512$ while still generating videos consisting of 8 frames. We train all models presented in this section for 50k steps.

E.3 Human Eval vs SOTA

For comparison of our image-to-video model with state-of-the-art models like Gen-2 and Pika , we randomly choose 64 conditioning images generated from a $1024\times 576$ finetune of SDXL . We employ the same framework as in Section E.1.1 to evaluate and compare the visual quality generated samples with other models.

For Gen-2, we sample the image-to-video model from the web UI. We fixed the same seed of 23, used the default motion value of 5 (on a scale of 10), and turned on the “Interpolate” and “Remove watermark” features. This results in 4-second samples at $1408\times 768$ . We then resize the shorter side to yield $1056\times 576$ and perform a center-crop to match our resolution of $1024\times 576$ . For our model, we sample our 25-frame image-to-video finetune to give 28 frames and also interpolate using our interpolation model to yield samples of 3.89 seconds at 28 FPS. We crop the Gen-2 samples to 3.89 seconds to avoid biasing the annotators.

For Pika, we sample the image-to-video model from the Discord bot. We fixed the same seed of 23, used the motion value of 2 (on a scale of 0-4), and specified a 16:9 aspect ratio. This results in 3-second samples at $1024\times 576$ , which matches our resolution. For our model, we sample our 25-frame image-to-video finetune to give 28 frames and also interpolate using our interpolation model to yield samples of 3.89 seconds at 28 FPS. We crop our samples to 3 seconds to match Pika and avoid biasing the annotators. Since Pika samples have a small “Pika Labs” watermark in the bottom right, we pad that region with black pixels for both Pika and our samples to also avoid bias.

E.4 UCF101 FVD

This section describes the zero-shot UCF101 FVD computation of our base text-to-video model. The UCF101 dataset consists of 13,320 video clips, which are classified into 101 action categories. All videos are of frame rate 25 FPS and resolution $240\times 320$ . To compute FVD, we generate 13,320 videos (16 frames at 25 FPS, classifier-free guidance with scale $w=7$ ) using the same distribution of action categories, that is, for example, 140 videos of “TableTennisShot”, 105 videos of “PlayingPiano”, etc. We condition the model directly on the action category (“TableTennisShot”, “PlayingPiano”, etc.) and do not use any text modification. Our samples are generated at our model’s native resolution $320\times 576$ (16 frames), and we downsample to $240\times 432$ using bilinear interpolation with antialiasing, followed by a center crop to $240\times 320$ . We extract features using a pretrained I3D action classification model , in particular we are using a torchscripthttps://www.dropbox.com/s/ge9e5ujwgetktms/i3d_torchscript.pt with keyword arguments rescale=True, resize=True, return_features=True. provided by Brooks et al. .

E.5 Additional Samples

Here, we show additional samples for the models introduced in Sections D.2, 4.2, 4.3 and 4.5.

In Figure 19, we show additional samples from our text-to-video model introduced in Section 4.2.

E.5.2 Additional Image-to-Video Samples

In Figure 20, we show additional samples from our image-to-video model introduced in Section 4.3.

E.5.3 Additional Camera Motion LoRA Samples

In Figure 21, we show additional samples for our motion LoRA’s tuned for camera control as presented in Section 4.3.1.

E.5.4 Temporal Prompting via Temporal Cross-Attention Layers

Our architecture follows Blattmann et al. , who introduced dedicated temporal cross-attention layers, which are used interleaved with the spatial cross-attention layers of the standard 2D-UNet . During probing our Text-to-Video model from Section 4.2, we noticed that it is possible to independently prompt the model spatially and temporally by using different text-prompts as inputs for the spatial and temporal cross-attention conditionings, see Figure 22. To achieve this, we use a dedicated spatial prompt to describe the general content of the scene to be depicted while the motion of that scene is fed to the model via a separate temporal prompt, which is the input to the temporal cross-attention layers. We provide an example of these first experiments indicating this implicit disentanglement of motion and content in Figure 22, where we show that varying the temporal prompt while fixing random seed and spatial prompt leads to spatially similar scenes that obtain global motion properties following the temporal prompt.

E.5.5 Additional Samples on Multi-View Synthesis

In Figures 23, 24, 25 and 26, we show additional visual examples for SVD-MV, trained on our renderings of Objaverse and MVImageNet datasets as described in Section 4.5.