Dreamix: Video Diffusion Models are General Video Editors

Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, Yedid Hoshen

Introduction

Recent advancements in generative models and multimodal vision-language models have paved the way to large-scale text-to-image models capable of unprecedented generation realism and diversity . These models have ushered in a new era of creativity, applications, and research efforts. Although these models offer new creative processes, they are limited to synthesizing new images rather than editing existing ones. To bridge this gap, intuitive text-based image editing methods offer text-based editing of generated and real images while maintaining some of their original attributes . Similarly to images, text-to-video models have recently been proposed , but there are currently very few methods using them for video editing.

In text-guided video editing, the user provides an input video and a text prompt which describes the desired attributes of the resulting video (Fig. 1). The objectives are three-fold: i) alignment: the edited video should conform with the input text prompt ii) fidelity: the edited video should preserve the content of the original input iii) quality: the edited video should be of high-quality. Video editing is more challenging than standard image editing, as it requires synthesizing new motion, not merely modifying visual appearance. It also requires temporal consistency. As a result, applying image-level editing methods e.g. SDEdit or Prompt-to-Prompt sequentially on the video frames is insufficient.

We present a new method, Dreamix, to adapt a text-conditioned video diffusion model (VDM) for video editing, in a manner inspired by UniTune . The core of our method is enabling a text-conditioned VDM to maintain high fidelity to an input video via two main ideas. First, instead of using pure-noise as initialization for the model, we use a degraded version of the original video, keeping only low spatio-temporal information by downscaling it and adding noise. Second, we further improve the fidelity to the original video by finetuning the generation model on the original video. Finetuning ensures the model has knowledge of the high-resolution attributes of the original video. A naive finetuning on the input video results in relatively low motion editabilty as the model learns to prefer the original motion instead of following the text prompt. We propose a novel, mixed finetuning approach, in which the VDMs are also finetuned on the collection of individual frames of the input video while discarding their temporal order. Technically, this is achieved by masking the temporal attention. Mixed finetuning significantly improves the quality of motion edits.

As a further contribution, we leverage our video editing model to propose a new framework for image animation (see Fig. 2). This has several applications including: animating the objects and background in an image, creating dynamic camera motion, etc. We do this by simple image processing operations, e.g. frame replication or geometric image transformation, to create a coarse video. We then edit it with our Dreamix video editor. We also use our novel finetuning approach for subject-driven video generation, i.e. a video version of Dreambooth. We perform an extensive qualitative study and a human evaluation, showcasing the remarkable abilities of our method. We compare our method against the state-of-the-art baselines, demonstrating superior results. To summarize, our main contributions are:

Proposing the first method for general text-based appearance and motion editing of real-world videos.

Proposing a novel mixed finetuning model that significantly improves the quality of motion edits.

Presenting a new framework for text-guided image animation, by applying our video editor method on top of simple image preprocessing operations.

Demonstrating subject-driven video generation from a collection of images, leveraging our novel finetuning method.

Related Work

Deep diffusion models recently emerged as a powerful new paradigm for image generation , and have their roots in score-matching . They outperform the previous state-of-the-art approach, generative adversarial networks (GANs) . While they have multiple formulations, EDM showed they are equivalent. Outstanding progress was made in text-to-image generation , where new images are sampled conditioned on an input text prompt. Extending diffusion models to video generation is a challenging computational and algorithmic task. Early work include and text-to-video extensions by . Another line of work extends synthesis to various image reconstruction tasks , extracts confidence intervals for reconstruction tasks.

2 Diffusion Models for Editing

Image editing with generative models has been studied extensively, in past years many of the models were based on GANs. Editing methods have recently adopted diffusion models . Several works proposed to use text-to-image diffusion models for editing rather than text-conditioned synthesis. SDEdit proposed to add targeted noise and other corruptions to an input image, and then use diffusion models for reversing the process. It can perform significant image edits, while losing some fidelity to the original image. Prompt-to-Prompt (and later Plug-and-Play and ) perform semantic edits by mixing activations extracted with the original and target prompts. For InstructPix2Pix this is only needed at test time. Other works (e.g. ) use finetuning and optimization to allow for personalization of the model, learning a special token describing the content. UniTune and Imagic finetune on a single image, allowing better editability while maintaining good fidelity. However, the above methods are image-centric and do not take temporal information into account. Text2Live allows some texture-based video editing but are not diffusion-based and cannot edit motion. A concurrent paper, Tune-a-Video preform video editing by inflating a text-to-image model to learn temporal consistency. Despite their promising results, they use a text-to-image backbone that can edit video appearance but not motion. Their results are also not fully temporally consistent. In contrast, our method uses a text-to-video backbone, enabling motion editing while maintaining video smoothness.

Background: Video Diffusion Models

Denoising Model Training. Diffusion models rely on deep denoising neural network $D_{\theta}$ . Let us denote the groundtruth video as $v$ , an i.i.d Gaussian noise tensor of the same dimensions as the video as $\epsilon\sim N(0,\textbf{I})$ , and the noise level at time $s$ as $\sigma_{s}$ . The noisy video is given by: $z_{s}=\gamma_{s}v+\sigma_{s}\epsilon$ , where $\gamma_{s}=\sqrt{1-\sigma^{2}_{s}}$ . Furthermore, let us denote a conditioning text prompt as $t$ and a conditioning video $c$ (for super-resolution, $c$ is a low-resolution version of $v$ ). The objective of the denoising network $D_{\theta}$ is to recover the groundtruth video $v$ given the noisy input video $z_{s}$ , the time $s$ , prompt $t$ and conditioning video $c$ . The model is trained on a (typically very large) training corpus $\mathcal{V}$ consisting of pairs of video $v$ and text prompts $t$ . The optimization objective is:

Sampling from Diffusion Models. The key challenge in diffusion models is to use the denoiser network $D$ to sample from the distribution of videos conditioned on the text prompt $t$ and conditioning video $c$ , $P(v|t,c)$ . While the derivation of such sampling rule is non-trivial (see e.g. ), the implementation of such sampling is relatively simple in practice. We follow in using stochastic DDIM sampling. At a heuristic level, at each step, we first use the densoier network to estimate the noise. We then remove a fraction of the estimated noise and finally add randomly generated Gaussian noise, with magnitude corresponding to half of the removed noise.

Cascaded Video Diffusion Models. Training high-resolution text-to-video models is very challenging due to the high computational complexity. Several diffusion models overcome this using cascaded architectures. We use Imagen-Video , which consists of a cascade of $7$ models. The base model maps the input text prompt into a $5$ -second video of $24\times 40\times 16$ frames. It is then followed by $3$ spatial super-resolution models and $3$ temporal super-resolution models. For implementation details, see Appendix A.

General Editing by Video Diffusion Models

We propose a new method for video editing using text-guided video diffusion models. We extended it to image animation in Sec. 5.

We wish to edit an input video using the guidance of a text prompt $t$ describing the video after the edit. In order to do so we leverage the power of a cascade of VDMs. The key idea is to first corrupt the video by downsampling followed by adding noise. We then apply the sampling process of the cascaded diffusion models from the time step corresponding to the noise level, conditioned on $t$ , which upscales the video to the final spatio-temporal resolution. The effect is that the VDM will use the low-resolution details provided by the degraded input video, but synthesize new high spatio-temporal resolution information using the text prompt guidance. While this procedure is essentially a text-guided version of SDEdit , there are some video specific technical challenges that we will describe below. Note, that this by itself does not result in sufficiently high-fidelity video editing. We present a novel finetuning objective for mitigating this issue in Sec. 4.2.

Input Video Degradation. We downsample the input video to the resolution of the base model ( $16$ frames of $24\times 40$ ). We then add i.i.d Gaussian noise with variance $\sigma_{s}^{2}$ to further corrupt the input video. The noise strength is equivalent to time $s$ in the diffusion process of the base text-to-video model. For $s=0$ , no noise is added, while for $s=1$ , the video is replaced by pure Gaussian noise. Note that even when no noise is added, the input video is highly corrupted due to the extreme downsampling ratio. For the non-finetuned base model, values of $s\in[0.4,0.85]$ typically worked best.

Text-Guided Corrpution Inversion. We can now use the cascaded VDMs to map the corrputed, low-resolution video into a high-resolution video that aligns with the text. The core idea here is that given a noisy, very low spatio-temporal resolution video, there are many perfectly feasible, high-resolution videos that correspond to it. We use the target text prompt $t$ to select the feasible outputs that not only correspond to the low-resolution of the original video but are also aligned to edits desired by the user. The base model starts with the corrupted video, which has the same noise as the diffusion process at time $s$ . We use the model to reverse the diffusion process up to time . We then upscale the video through the entire cascade of super-resolution models (see Appendix A). All models are conditioned on the prompt $t$ .

2 Mixed Video-Image Finetuning

The naive method presented in Sec. 4.1 relies on a corrupted version of the input video which does not include enough information to preserve high-resolution details such as fine textures or object identity. We tackle this issue by adding a preliminary stage of finetuning the model on the input video $v$ . Note that this only needs to be done once for the video, which can then be edited by many prompts without furher finetuning. We would like the model to separately update its prior both on the appearance and the motion of the input video. Our approach therefore treats the input video, both as a single video clip and as an unordered set of $M$ frames, denoted by $u=\{x_{1},x_{2},..,x_{M}\}$ . We use a rare string $t^{*}$ as the text prompt, following . We finetune the denoising models by a combination of two objectives. The first objective updates the model prior on both motion and appearance by requiring it to exactly reconstruct the input video $v$ given its noisy versions $z_{s}$ .

Additionally, we train the model to reconstruct each of the frames individually given their noisy version. This enhances the appearance prior of the model, separately from the motion. Technically, the model is trained on a sequence of frames $u$ by replacing the temporal attention layers by trivial fixed masks ensuring the model only pays attention within each frame, and also by masking the residual temporal convolution blocks. We denote the attention masked denoising model as $D^{a}_{\theta}$ . The masked attention objective is given by:

We train the objectives jointly and denote this mixed finetuning:

Where $\alpha$ is a hyperparameter weighting between the two objectives, (see Fig. 3). Training on a single video or a handful of frames can easily lead to overfitting, reducing the editing ability of the original model. To mitigate overfitting, we use a small number of finetuning iterations and a low learning rate (see Appendix A).

3 Hyperparameters

Our method has several hyperparameters. For inference time, we have the noise scale $s\in$ where $s=1$ corresponds to standard sampling without using the degraded input video. For finetuning, we have the number of finetuning steps $FT_{steps}$ , learning rate $lr$ , and mixing weight $\alpha$ between the video and frames finetuning objectives (see Sec. 4.2). See Fig. 7 for a qualitative analysis of hyperparameter impact, and Sec. 6.3 for a quantatitve analysis. Additional implementation details may be found in Appendix A.

Applications of Dreamix

The method proposed in Sec. 4, can naturally be used to edit motion and appearance in real-world videos. In this section, we propose a framework for using our Dreamix video editor for general, text-conditioned image-to-video editing, see Fig. 4 for an overview.

Dreamix for Single Images. Provided our general video editing method, Dreamix, we now propose a framework for image animation conditioned on a text prompt. The idea is to transform the image or a set of images into a coarse, corrupted video and edit it using Dreamix. For example, given a single image $x$ as input, we can transform it to a video by replicating it $16$ times to form a static video $v=[x,x,x...x]$ . We can then edit its appearance and motion using Dreamix conditioned on a text prompt. Here, we do not wish to incorporate the motion of the input video (as it is static and meaningless) and therefore use only the masked temporal attention finetuning ( $\alpha=0$ ). We can further control the output video, by simulating camera motion, such as panning and zoom. We perform this by sampling a smooth sequence of $16$ perspective transformations $T_{1},T_{2}..T_{16}$ and apply each on the original image. When the perspective requires pixels outside the input image, we simply outpaint them using reflection padding. We concatenate the sequence of transformed images into a low quality input video $v=[T_{1}(x),T_{2}(x)..T_{16}(x)]$ . While this does not result in realistic video, Dreamix can transform it into a high-quality edited video.

Dreamix for subject-driven video generation. We propose to use Dreamix for text-conditioned video generation given an image collection. The input to our method is a set of images, each containing the subject of interest. This can potentially also use different frames from the same video, as long as they show the same subject. Higher diversity of viewing angles and backgrounds is beneficial for the performance of the method. We then use our novel finetuning method from Sec. 4.2, where we only use the masked attention finetuning ( $\alpha=0$ ). After finetuning, we use the text-to-image model without a conditioning video, but rather only using a text prompt (which includes the special token $t^{*}$ ).

Experiments

We showcase the results of Dreamix, demonstrating unprecendented video editing and image animation abilities.

Video Editing. In Fig. 1, we change the motion to dancing and the appearance from monkey to bear. keeping the coarse attributes of the video fixed. Dreamix can also generate new motion that does not necessarily align with the input video (puppy in Fig. 5, orangutan in Fig. 13), and can control camera movements (zoom-out example in Fig. 14). Dreamix can generate smooth visual modifications that align with the temporal information in the input video. This includes adding effects (field in Fig. 10, saxophone in the Fig. 14), adding objects (hat in Fig. 10 and skateboard in Fig. 11) or replacing them (robot in Fig. 10), changing the background (truck in the Fig. 14).

Image-driven Videos. When the input is a single image, Dreamix can use its video prior to to add new moving objects (camel in Fig. 9), inject motion into the input (turtle in Fig. 2 and coffee in Fig. 6), or create new camera movements (buffalo in Fig. 6). Our method is unique in being able to do this for general, real-world images.

Subject-driven Video Generation. Dreamix can take an image collection showing the same subject and generate new videos with this subject in motion. This is unique, as previous approaches could only do this for images. We demonstrate this on a range of subjects and actions including: the weight-lifting toy fireman in Fig. 2, walking and drinking bear in Fig. 6 and Fig. 9. It can place the subjects in new surroundings, e.g., moving caterpillar to a leaf in Fig. 9 and even under a magnifying glass in Fig. 9.

2 Baseline Comparisons

Baselines. We compare our method against two baselines:

Text-to-Video. Directly mapping the text prompt to a video, without conditioning on the input video using Imagen-Video.

Plug-and-Play (PnP). A common approach for video editing is to apply text-to-image editing on each frame individually. We apply PnP (a SoTA method) on each frame independently and concatenate the frames into a video.

Quantitative Comparison. We performed a human-rated evaluation of Dreamix and the baselines on a dataset of $29$ videos taken from YouTube-8M , and $127$ text prompts, across different categories. We used a single hyperparamter set for all methods. Each edited video was rated on a scale of $1-5$ to evaluate its visual quality, its fidelity to the unedited details of the base video and its alignment with the text prompt. We collected $4-6$ ratings for each edited video. The results of the evaluation can be seen in Tab. 2. We also highlight the success rate of each method, where a successful edit is one that received a mean score larger than 2 in all dimensions. We observe that frame by frame methods like Plug-and-Play perform poorly in terms of visual quality as they create flickering effects due to the lack of temporal input. Moreover, Plug-and-Play sometimes ignored the edit altogether, resulting in low alignment and high fidelity. The Text-to-Video baseline ignores the edited video, resulting in low fidelity. Our method balances between the three dimensions, resulting in a high success rate.

Qualitative Comparison. Fig. 8 presents an example of a video edited by Dreamix and the two baselines. The text-to-video model achieves low fidelity edits as it is not conditioned on the original video. PnP preserves the scene but lacks consistency between different frames. Dreamix performs well in all three objectives.

3 Ablation Study

We conducted a user study comparing our proposed mixed finetuning method (See Sec. 4.2) to two ablations: no finetuning and finetuning on the video only (but not the independent frames). Our dataset contained $29$ videos (each of $5$ seconds) taken from YouTube-8M , and a total of $127$ text prompts. Additional details are provided in Appendix B. The results are presented in Tab. 1. Our main observations are: Motion changes require high-editability. Frame-based finetuning typically outperformed video-only finetuning. Denoising without finetuning worked well for style transfer, finetuning was often detrimental. Preserving fine-details in background, color or texture changes required finetuning.

Discussion

In this section, we analyse the limitations of our method, potential ways to address them and future applications.

Hyperparameter Selection. Optimal hyperparameter values e.g., noise strength, can change between prompts. Automating their selection will make our method more user friendly. It can be done by learning a regressor from (input video, prompt) to the optimal hyperparameters. Creating a training set with the optimal hyperparameters per-edit (e.g. as judged by users) is left for future work.

Automatic Evaluation Metrics. In our preliminary study, we found that automatic evaluation metrics (e.g. CLIP Score for alignment) are imperfectly correlated with human preference. Future work on automatic video text-editing metrics should address this limitation. Having effective metrics will also support labeling large datasets for the automatic hyperparameter selection suggested above.

Frequency of Objects in Dataset and Editability. Not all prompt-video pairs yield successful edits (as can be seen in Tab. 2). Being able to determine the successful pairs in advance, will speed up the creative editing process. In preliminary work, we found that edits containing objects and actions that frequently occurred in the training dataset resulted in better edits than rarer ones. This suggests that an automatic method for prompt engineering is a promising direction.

Computational Cost. VDMs are computationally expensive. Finetuning our model, containing billions of parameters, requires large hardware accelerators around $30$ minutes per video. Speeding it up and lowering the computational cost, will allow our method to be used for a larger set of applications.

Future Applications. We expect Dreamix to have many future applications. Several promising ones are: motion interpolation between an image pair, text-guided inpainting and outpainting.

Conclusion

We presented a general approach for text-conditioned editing using video diffusion models. Beyond video editing, we introduced a new framework for image animation. We also applied our method to subject-driven video generation. Extensive experiments demonstrated the unprecedented results of our method.

Social Impact

Our primary aim in this work is to advance research on tools to enable users to animate their personal content. While the development of end-user applications is out of the scope of this work, we recognize both the opportunities and risks that may follow from our contributions. As discussed above, we anticipate multiple possible applications for this work that have the potential to augment and extend creative practices. The personalized component of our approach brings particular promise as it will enable users to better align content with their intent, despite potential biases present in general VDMs. On the other hand, our method carries similar risks as other highly capable media generation approaches. Malicious parties may try to use edited videos to mis-lead viewers or to engage in targeted harassment. Future research must continue investigating these concerns.

Acknowledgements

We thank Ely Sarig for creating the video, Jay Tenenbaum for the video narration, Amir Hertz for the implementation of our eval baseline, Daniel Cohen-Or, Assaf Zomet, Eyal Segalis, Matan Kalman and Emily Denton for their valuable inputs that helped improve this work.

References

Appendix A Implementation Details

All of our experiments were preformed on Imagen-Video , a pertrained cascaded video diffusion model, with the following components:

a T5-XXL text encoder, that computes embeddings from the textual prompt. This embeddings are then used as a condition by all other models.

a base video diffusion model, conditioned on text. It generates videos at $16\times 24\times 40\times 3$ resolution (frames $X$ height $X$ width $X$ channels) at $3$ fps.

$6$ super-resolution video diffusion models, each conditioned on text and on the output video of the previous model. Each model is either spatial (SSR), i.e. upscales resolution, or temporal (TSR), i.e. fills in intermediate frames between the input frames. The order of super resolution models is TSR (2x), SSR (2x), SSR(4x), TSR(2x), TSR(2x), and SSR(4x). The multiplier in the parenthesis for output frames (for TSR), and for output pixels in height and width (for SSR). The final output video is in $128\times 768\times 1280\times 3$ at $24$ fps.

Note that the diffusion models are pretrained on both videos and images, with frozen temporal attention and convolution for the latter. Our mixed finetuning approach treats video frames as if they were images.

Distillation. For some of these models, we use a distilled version to allow for faster sampling times. The base model is a distilled model with $64$ sampling steps. The first two SSR models are non-distilled models with $128$ sampling steps (due to finetuning considerations, see below). All other SR models use $8$ sampling steps. All models use classifier-free-guidance weight of 1.0 (meaning that classifier free guidance is turned off).

A.2 Finetuning

To reduce finetuning time, we only finetune the base model and the first 2 SSR models. In our experiments, finetuning the first 2 SSR models using the distilled models (with $8$ sampling steps) did not yield good quality. We therefore use the non-distilled versions of these models for all experiments (including non-finetuned experiments). Good combinations of finetuning hyperparameters are:

$\alpha=1.0$ (video only finetuning), $FT_{steps}=64$

$\alpha=0.35$ (mixed video / video-frame finetuning), $FT_{steps}\in$

$\alpha=0$ (video-frame only finetuning, $FT_{steps}\in$

The learning rate ( $lr$ ) we use in all experiments is $6\cdot 10^{-6}$ , much lower then the value used for pretraining the models.

A.3 Sampling

We use a DDIM sampler with stochastic noise correction, following . For the last highest resolution SSR, for capacity reasons, we use the model to sample a sub-chunks of 32 frames of the input lower resolution videos, and then we concatenate all the outputs together back to 128 frame videos.

Noise strength. We got the best results for the following values of noise strength $s$ : for non-finetuned models, $s\in[0.4,0.85]$ and for finetuned models, $s\in[0.95,1.0]$ .

Appendix B Human evaluations details

We performed human evaluations for the baseline comparison and the ablation analysis. Both evaluations were conducted by a panel of $10$ human raters, over a dataset of $29$ videos with $127$ edit prompts. The dataset videos were selected from YouTube-8M and show animals, people performing actions, vehicles, and other objects. The edit prompt categories are detailed in Tab. 1 of the main paper. The video resolution shown to raters was $350\times 200$ .

In the ablation analysis the raters selected the best edited video out of $12$ hyperparameter combinations.

In the baseline comparison, the raters saw the original video alongside an edited video and answered the following questions:

Rate the overall visual quality and smoothness of the edited video.

How well does the edited video match the textual edit description provided?

How well does the edited video preserve unedited details of the original video?

We used a single set of hyperparmeters in the baseline eval: $\alpha=0.35;FT_{steps}=300;s=1$ .

Appendix C Image Attribution

Desert - https://unsplash.com/photos/PP8Escz15d8

Fuji mountain https://unsplash.com/photos/9Qwbfa_RM94

Tree in snow - https://unsplash.com/photos/aQNy0za7x0k

Hut in snow - https://unsplash.com/photos/qV2p17GHKbs

Lake with trees - https://unsplash.com/photos/dIQlgwq6V3Y

Plant - https://unsplash.com/photos/LrPKL7jOldI

Turtle - https://unsplash.com/photos/za9MCg787eI

Yosemite - https://unsplash.com/photos/NRQV-hBF10M

Foggy forest - https://unsplash.com/photos/pKNqyx_v62s

Coffee - https://unsplash.com/photos/SMPe5xfbPT0

Monkey - https://www.pexels.com/video/a-brown-monkey-eating-bread-2436088/