Structure and Content-Guided Video Synthesis with Diffusion Models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis

Introduction

Visual effects and video editing are ubiquitous in the modern media landscape. As such, demand for more intuitive and performant video editing tools has increased as video-centric platforms have been popularized. However, editing in the format is still complex and time-consuming due the temporal nature of video data. State-of-the-art machine learning models have shown great promise in improving the editing process, but methods often balance temporal consistency with spatial detail.

Generative approaches for image synthesis recently experienced a rapid surge in quality and popularity due to the introduction of powerful diffusion models trained on large-scale datasets. Text-conditioned models, such as DALL-E 2 and Stable Diffusion , enable novice users to generate detailed imagery given only a text prompt as input. Latent diffusion models especially offer efficient methods for producing imagery via synthesis in a perceptually compressed space.

Motivated by the progress of diffusion models in image synthesis, we investigate generative models suited for interactive applications in video editing. Current methods repurpose existing image models by either propagating edits with approaches that compute explicit correspondences or by finetuning on each individual video . We aim to circumvent expensive per-video training and correspondence calculation to achieve fast inference for arbitrary videos.

We propose a controllable structure and content-aware video diffusion model trained on a large-scale dataset of uncaptioned videos and paired text-image data. We opt to represent structure with monocular depth estimates and content with embeddings predicted by a pre-trained neural network. Our approach offers several powerful modes of control in its generative process. First, similar to image synthesis models, we train our model such that the content of inferred videos, e.g. their appearance or style, match user-provided images or text prompts (Fig. Structure and Content-Guided Video Synthesis with Diffusion Models). Second, inspired by the diffusion process, we apply an information obscuring process to the structure representation to enable selecting of how strongly the model adheres to the given structure. Finally, we also adjust the inference process via a custom guidance method, inspired by classifier-free guidance, to enable control over temporal consistency in generated clips.

In summary, we present the following contributions:

We extend latent diffusion models to video generation by introducing temporal layers into a pre-trained image model and training jointly on images and videos.

We present a structure and content-aware model that modifies videos guided by example images or texts. Editing is performed entirely at inference time without additional per-video training or pre-processing.

We demonstrate full control over temporal, content and structure consistency. We show for the first time that jointly training on image and video data enables inference-time control over temporal consistency. For structure consistency, training on varying levels of detail in the representation allows choosing the desired setting during inference.

We show that our approach is preferred over several other approaches in a user study.

We demonstrate that the trained model can be further customized to generate more accurate videos of a specific subject by finetuning on a small set of images.

Related Work

Controllable video editing and media synthesis is an active area of research. In this section, we review prior work in related areas and connect our method to these approaches.

Unconditional video generation Generative adversarial networks (GANs) can learn to synthesize videos based on specific training data . These methods often struggle with stability during optimization, and produce fixed-length videos or longer videos where artifacts accumulate over time . synthesize longer videos at high detail with a custom positional encoding and an adversarially-trained model leveraging the encoding, but training is still restricted to small-scale datasets. Autoregressive transformers have also been proposed for unconditional video generation . However, our focus is on providing user control over the synthesis process whereas these approaches are limited to sampling random content resembling their training distribution.

Diffusion models for image synthesis Diffusion models (DMs) have recently attracted the attention of researchers and artists alike due to their ability to synthesize detailed imagery , and are now being applied to other areas of content creation such as motion synthesis and 3d shape generation .

Other works improve image-space diffusion by changing the parameterization , introducing advanced sampling methods , designing more powerful architectures , or conditioning on additional information . Text-conditioning, based on embeddings from CLIP or T5 , has become a particularly powerful approach for providing artistic control over model output . Latent diffusion models (LDMs) perform diffusion in a compressed latent space reducing memory requirements and runtime. We extend LDMs to the spatio-temporal domain by introducing temporal connections into the architecture and by training jointly on video and image data.

Diffusion models for video synthesis Recently, diffusion models, masked generative models and autoregressive models have been applied to text-conditioned video synthesis . Similar to and , we extend image synthesis diffusion models to video generation by introducing temporal connections into a pre-existing image model. However, rather than synthesizing videos, including their structure and dynamics, from scratch, we aim to provide editing abilities on existing videos. While the inference process of diffusion models enables editing to some degree , we demonstrate that our model with explicit conditioning on structure is significantly preferred.

Video translation and propagation Image-to-image translation models, such as pix2pix , can process each individual frame in a video, but this produces inconsistency between frames as the model lacks awareness of the temporal neighborhood. Accounting for temporal or geometric information, such as flow, in a video can increase consistency across frames when repurposing image synthesis models . We can extract such structural information to aid our spatio-temporal LDM in text- and image-guided video synthesis. Many generative adversarial methods, such as vid2vid , leverage this type of input to guide synthesis combined with architectures specifically designed for spatio-temporal generation. However, similar to GAN-based approaches for images, results have been mostly limited to singular domains.

Video style transfer takes a reference style image and statistically applies its style to an input video . In comparison, our method applies a mix of style and content from an input text prompt or image while being constrained by the extracted structure data. By learning a generative model from data, our approach produces semantically consistent outputs instead of matching feature statistics.

Text2Live allows editing input videos using text prompts by decomposing a video into neural layers . Once available, a layered video representation provides consistent propagation across frames. SinFusion can generate variations and extrapolations of videos by optimizing a diffusion model on a single video. Similarly, Tune-a-Video finetunes an image model converted to video generation on a single video to enable editing. However, expensive per-video training limits the practicality of these approaches in creative tools. We opt to instead train our model on a large-scale dataset permitting inference on any video without individual training.

Method

For our purposes, it will be helpful to think of a video in terms of its content and structure. By structure, we refer to characteristics describing its geometry and dynamics, e.g. shapes and locations of subjects as well as their temporal changes. We define content as features describing the appearance and semantics of the video, such as the colors and styles of objects and the lighting of the scene. The goal of our model is then to edit the content of a video while retaining its structure.

To achieve this, we aim to learn a generative model $p(x|s,c)$ of videos $x$ , conditioned on representations of structure, denoted by $s$ , and content, denoted by $c$ . We infer the shape representation $s$ from an input video, and modify it based on a text prompt $c$ describing the edit. First, we describe our realization of the generative model as a conditional latent video diffusion model and, then, we describe our choices for shape and content representations. Finally, we discuss the optimization process of our model. See Fig. 2 for an overview.

Diffusion models learn to reverse a fixed forward diffusion process, which is defined as

Normally-distributed noise is slowly added to each sample $x_{t-1}$ to obtain $x_{t}$ . The forward process models a fixed Markov chain and the noise is dependent on a variance schedule $\beta_{t}$ where $t\in\{1,\dots,T\}$ , with $T$ being the total number of steps in our diffusion chain, and $x_{0}\coloneqq x$ .

Learning to Denoise The reverse process is defined according to the following equation with parameters $\theta$

Using a fixed variance $\Sigma_{\theta}(x_{t},t)$ , we are left learning the means of the reverse process $\mu_{\theta}(x_{t},t)$ . Training is typically performed via a reweighted variational bound on the maximum likelihood objective, resulting in a loss

where $\mu_{t}(x_{t},x_{0})$ is the mean of the forward process posterior $q(x_{t-1}|x_{t},x_{0})$ , which is available in closed form .

Parameterization The mean $\mu_{\theta}(x_{t},t)$ is then predicted by a UNet architecture that receives the noisy input $x_{t}$ and the diffusion timestep $t$ as inputs. Instead of directly predicting the mean, different combinations of parameterizations and weightings, such as $x_{0}$ , $\epsilon$ and $v$ -parameterizations have been proposed, which can have significant effects on sample quality. In early experiments, we found it beneficial to use $v$ -parameterization to improve color consistency of video samples, similar to the findings of , and therefore we use it for all experiments.

Latent diffusion Latent diffusion models (LDMs) take the diffusion process into the latent space. This provides an improved separation between compressive and generative learning phases of the model. Specifically, LDMs use an autoencoder where an encoder $\mathcal{E}$ maps input data $x$ to a lower dimensional latent code according to $z=\mathcal{E}(x)$ while a decoder $\mathcal{D}$ converts latent codes back to the input space such that perceptually $x\approx\mathcal{D}(\mathcal{E}(x))$ .

2 Spatio-temporal Latent Diffusion

To correctly model a distribution over video frames, the architecture must take relationships between frames into account. At the same time, we want to jointly learn an image model with shared parameters to benefit from better generalization obtained by training on large-scale image datasets.

To achieve this, we extend an image architecture by introducing temporal layers, which are only active for video inputs. All other layers are shared between the image and video model. The autoencoder remains fixed and processes each frame in a video independently.

The UNet consists of two main building blocks: Residual blocks and transformer blocks (see Fig. 3). Similar to , we extend them to videos by adding both 1D convolutions across time and 1D self-attentions across time. In each residual block, we introduce one temporal convolution after each 2D convolution. Similarly, after each spatial 2D transformer block, we also include one temporal 1D transformer block, which mimics its spatial counterpart along the time axis. We also input learnable positional encodings of the frame index into temporal transformer blocks.

In our implementation, we consider images as videos with a single frame to treat both cases uniformly. A batched tensor with batch size $b$ , number of frames $n$ , $c$ channels, and spatial resolution $w\times h$ (i.e. shape $b\times n\times c\times h\times w$ ) is rearranged to $(b\cdot n)\times c\times h\times w$ for spatial layers, to $(b\cdot h\cdot w)\times c\times n$ for temporal convolutions, and to $(b\cdot h\cdot w)\times n\times c$ for temporal self-attention.

3 Representing Content and Structure

Conditional Diffusion Models Diffusion models are well-suited to modeling conditional distributions such as $p(x|s,c)$ . In this case, the forward process $q$ remains unchanged while the conditioning variables $s,c$ become additional inputs to the model.

We limit ourselves to uncaptioned video data for training due to the lack of large-scale paired video-text datasets similar in quality to image datasets such as . Thus, while our goal is to edit an input video based on a text prompt describing the desired edited video, we have neither training data of triplets with a video, its edit prompt and the resulting output, nor even pairs of videos and text captions.

Therefore, during training, we must derive structure and content representations from the training video $x$ itself, i.e. $s=s(x)$ and $c=c(x)$ , resulting in a per-example loss of

In contrast, during inference, structure $s$ and content $c$ are derived from an input video $y$ and from a text prompt $t$ respectively. An edited version $x$ of $y$ is obtained by sampling the generative model conditioned on $s(y)$ and $c(t)$ :

Content Representation To infer a content representation from both text inputs $t$ and video inputs $x$ , we follow previous works and utilize CLIP image embeddings to represent content. For video inputs, we select one of the input frames randomly during training. Similar to , one can then train a prior model that allows sampling image embeddings from text embeddings. This approach enables specifying edits through image inputs instead of just text.

Decoder visualizations demonstrate that CLIP embeddings have increased sensitivity to semantic and stylistic properties while being more invariant towards precise geometric attributes, such as sizes and locations of objects . Thus, CLIP embeddings are a fitting representation for content as structure properties remain largely orthogonal.

Structure Representation A perfect separation of content and structure is difficult. Prior knowledge about semantic object classes in videos influences the probability of certain shapes appearing in a video. Nevertheless, we can choose suitable representations to introduce inductive biases that guide our model towards the intended behavior while decreasing correlations between structure and content.

We find that depth estimates extracted from input video frames provide the desired properties as they encode significantly less content information compared to simpler structure representations. For example, edge filters also detect textures in a video which limits the range of artistic control over content in videos. Still, a fundamental overlap between content and structure information remains with our choice of CLIP image embeddings as a content representation and depth estimates as a structure representation. Depth maps reveal the silhouttes of objects which prevents content edits involving large changes in object shape.

To provide more control over the amount of structure to preserve, we propose to train a model on structure representations with varying amounts of information. We employ an information-destroying process based on a blur operator, which improves stability compared to other approaches such as adding noise. Similar to the diffusion timestep $t$ , we provide the structure blurring level $t_{s}$ as an input to the model. We note that blurring has also been explored as a forward process for generative modeling .

While depths map work well for our usecase, our approach generalizes to other geometric guidance features or combinations of features that might be more helpful for other specific applications. For example, models focusing on human video synthesis might benefit from estimated poses or face landmarks.

Conditioning Mechanisms We account for the different characteristics of our content and structure with two different conditioning mechanisms. Since structure represents a significant portion of the spatial information of video frames, we use concatenation for conditioning to make effective use of this information. In contrast, attributes described by the content representation are not tied to particular locations. Hence, we leverage cross-attention which can effectively transport this information to any position.

We use the spatial transformer blocks of the UNet architecture for cross-attention conditioning. Each contains two attention operations, where the first one perform a spatial self-attention and the second one a cross attention with keys and values computed from the CLIP image embedding.

To condition on structure, we first estimate depth maps for all input frames using the MiDaS DPT-Large model . We then apply $t_{s}$ iterations of blurring and downsampling to the depth maps, where $t_{s}$ controls the amount of structure to preserve from the input video. During training, we randomly sample $t_{s}$ between and $T_{s}$ . At inference, this parameter can be controlled to achieve different editing effects (see Fig. 10). We resample the perturbed depth map to the resolution of the RGB-frames and encode it using $\mathcal{E}$ . This latent representation of structure is concatenated with the input $z_{t}$ given to the UNet. We also input four channels containing a sinusoidal embedding of $t_{s}$ .

Sampling While Eq. (2) provides a direct way to sample from the trained model, many other sampling methods require only a fraction of the number of diffusion timesteps to achieve good sample quality. We use DDIM throughout our experiments. Furthermore, classifier-free diffusion guidance significantly improves sample quality. For a conditional model $\mu_{\theta}(x_{t},t,c)$ , this is achieved by training the model to also perform unconditional predictions $\mu_{\theta}(x_{t},t,\emptyset)$ and then adjusting predictions during sampling according to

where $\omega$ is the guidance scale that controls the strength. Based on the intuition that $\omega$ extrapolates the direction between an unconditional and a conditional model, we apply this idea to control temporal consistency of our model. Specifically, since we are training both an image and a video model with shared parameters, we can consider predictions by both models for the same input. Let $\mu_{\theta}(z_{t},t,c,s)$ denote the prediction of our video model, and let $\mu^{\pi}_{\theta}(z_{t},t,c,s)$ denote the prediction of the image model applied to each frame individually. Taking classifier-free guidance for $c$ into account, we then adjust our prediction according to

Our experiments demonstrate that this approach controls temporal consistency in the outputs, see Fig. 4.

4 Optimization

We train on an internal dataset of 240M images and a custom dataset of 6.4M video clips. We use image batches of size 9216 with resolutions of $320\times 320$ , $384\times 320$ and $448\times 256$ , as well as the same resolutions with flipped aspect ratios. We sample image batches with a probabilty of 12.5%. For the main training, we use video batches containing 8 frames sampled four frames apart with a resolution of $448\times 256$ and a total video batch size of 1152.

We train our model in multiple stages. First, we initialize model weights based on a pretrained text-conditional latent diffusion model https://github.com/runwayml/stable-diffusion. We change the conditioning from CLIP text embeddings to CLIP image embeddings and fine-tune for 15k steps on images only. Afterwards, we introduce temporal connections as described in Sec. 3.2 and train jointly on images and videos for 75k steps. We then add conditioning on structure $s$ with $t_{s}\equiv 0$ fixed and train for 25k steps. Finally, we resume training with $t_{s}$ sampled uniformly between and $7$ for another 10k steps.

Results

To evaluate our approach, we use videos from DAVIS and various stock footage. To automatically create edit prompts, we first run a captioning model to obtain a description of the original video content. We then use GPT-3 to generate edited prompts.

We demonstrate that our approach performs well on a number of diverse inputs (see Fig. 5). Our method handles static shots (first row) as well as shaky camera motion from selfie videos (second row) without any explicit tracking of the input videos. We also see that it handles a large variety of footage such as landscapes and close-ups. Our approach is not limited to a specific domain of subjects thanks to its general structure representation based on depth estimates. The generalization obtained from training simultaneously on large-scale image and video datasets enables many editing capabilities, including changes to animation styles such as anime (first row) or claymation (second row), changes in the scene environment, e.g. changing day to sunset (third row) or summer to winter (fourth row), as well as various changes to characters in a scene, e.g. turning a hiker into an alien (fifth row) or turning a bear in nature into a space bear walking through the stars (sixth row).

Using content representations through CLIP image embeddings allows users to specify content through images. One particular example application is character replacement, as shown in Fig. 9. We demonstrate this application using a set of six videos. For every video in the set, we re-synthesize it five times, each time providing a single content image taken from another video in the set. We can retain content characteristics with $t_{s}=3$ despite large differences in their pose and shape.

Lastly, we are given a great deal of flexibilty during inference due to our application of versatile diffusion models. We illustrate the use of masked video editing in Fig. 8, where our goal is to have the model predict everything outside the masked area(s) while retaining the original content inside the masked area. Notably, this technique resembles approaches for inpainting with diffusion models . In Sec. 4.3, we also evaluate the ability of our approach to control other characteristics such as temporal consistency and adherence to the input structure.

2 User Study

Text-conditioned video-to-video translation is a nascent area of computer vision and thus find a limited number of methods to compare against. We benchmark against Text2Live , a recent approach for text-guided video editing that employs layered neural atlases . As a baseline, we compare against SDEdit in two ways; per-frame generated results and a first-frame result propagated by a few-shot video stylization method (IVS). We also include two depth-based versions of Stable Diffusion; one trained with depth-conditioning and one that retains past results based on depth estimates . We also include an ablation: applying SDEdit to our video model trained without conditioning on a structure representation (ours, $\sim s$ ).

We judge the success of our method qualitatively based on a user study. We run the user study using Amazon Mechanical Turk (AMT) on an evaluation set of 35 representative video editing prompts. For each example, we ask 5 annotators to compare faithfulness to the video editing prompt (”Which video better represents the provided edited caption?”) between a baseline and our method, presented in random order, and use a majority vote for the final result.

The results can be found in Fig. 7. Across all compared methods, results from our approach are preferred roughly 3 out of 4 times. A visual comparison among the methods can be found in Fig. S13. We observe that SDEdit is quite sensitive to the editing strength. Low values often do not achieve the desired editing effect and high values change the structure of the input, e.g. in Fig. S13 the elephant looks into another direction after the edit. While the use of a fixed seed is able to keep the overall color of outputs consistent across frames, both style and structure can change in unnatural ways between frames as their relationship is not modeled by image based approaches. Overall, we observe that deforum behaves very similarly. Propagation of SDEdit outputs with few-shot video stylization leads to more consistent results, but often introduces propagation artifacts, especially in case of large camera or subject movements. Depth-SD produces accurate, structure-preserving edits on individual frames but without modeling temporal relationships, frames are inconsistent across time.

The quality of Text2Live outputs varies a lot. Due to its reliance in Layered Neural Atlases , the outputs tend to be temporally smooth but it often struggles to perform edits that represent the edit prompt accurately. A direct comparison is difficult as Text2Live requires input masks and edit prompts for foreground and background. In addition, computing a neural atlas takes about 10 hours whereas our approach requires approximately a minute.

3 Quantitative Evaluation

We quantify trade-offs between frame consistency and prompt consistency with the following two metrics. Frame consistency We compute CLIP image embeddings on all frames of output videos and report the average cosine similarity between all pairs of consecutive frames. Prompt consistency We compute CLIP image embeddings on all frames of output videos and the CLIP text embedding of the edit prompt. We report average cosine similarity between text and image embedding over all frames.

Fig. 6 shows the results of each model using our frame consistency and prompt consistency metrics. Our model tends to outperform the baseline models in both aspects (placed higher in the upper-right quadrant of the graph). We also notice a slight tradeoff with increasing the strength parameters in the baseline models: larger strength scales implies higher prompt consistency at the cost of lower frame consistency. Increasing the temporal scale ( $\omega_{t}$ ) of our model results in higher frame consistency but lower prompt consistency. We also observe that an increased structure scale ( $t_{s}$ ) results in higher prompt consistency as the content becomes less determined by the input structure.

4 Customization

Customization of pretrained image synthesis models allows users to generate images of custom content, such as people or image styles, based on a small training dataset for finetuning . To evaluate customization of our depth-conditioned latent video diffusion model, we finetune it on a set of 15-30 images and produce novel content containing the desired subject. During finetuning, half of the batch elements are of the custom subject and the other half are of the original training dataset to avoid overfitting.

Fig. 10 shows an example with different numbers of customization steps as well as different levels of structure adherence $t_{s}$ . We observe that customization improves fidelity to the style and appearance of the character, such that in combination with higher values for $t_{s}$ accurate animations are possible despite using a driving video of a person with different characteristics.

Conclusion

Our latent video diffusion model synthesizes new videos given structure and content information. We ensure structural consistency by conditioning on depth estimates while content is controlled with images or natural language. Temporally stable results are achieved with additional temporal connections in the model and joint image and video training. Furthermore, a novel guidance method, inspired by classifier-free guidance, allows for user control over temporal consistency in outputs. Through training on depth maps with varying degrees of fidelity, we expose the ability to adjust the level of structure preservation which proves especially useful for model customization. Our quantitative evaluation and user study show that our method is highly preferred over related approaches. Future works could investigate other conditioning data, such as facial landmarks and pose estimates, and additional 3d-priors to improve stability of generated results. We do not intend for the model to be used for harmful purposes but realize the risks and hope that further work is aimed at combating abuse of generative models.