FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, Qifeng Chen

Introduction

Diffusion-based models can generate diverse and high-quality images and videos through text prompts. It also brings large opportunities to edit real-world visual content from these generative priors.

Previous or concurrent diffusion-based editing methods majorly work on images. To edit real images, their methods utilize deterministic DDIM for the image-to-noise inversion, and then, the inverted noise gradually generates the edited images under the condition of the target prompt. Based on this pipeline, several methods have been proposed in terms of cross-attention guidance , plug-and-play feature , and optimization .

Manipulating videos through generative priors as image editing methods above contains many challenges (Fig. 7). First, there are no publicly available generic text-to-video models . Thus, a framework based on image models can be more valuable than on video ones , thanks to the various open-sourced image models in the community . However, the text-to-image models lack the consideration of temporal-aware information, e.g., motion and 3D shape understanding. Directly applying the image editing methods to the video will show obverse flickering. Second, although we can use previous video editing methods via keyframe or atlas editing , these methods still need atlas learning , keyframe selection , and per-prompt tunning . Moreover, while they may work well on the attribute and style editing, the shape editing is still a big challenge . Finally, as introduced above, current editing methods use DDIM for inversion and then denoising via the new prompt. However, in video inversion, the inverted noise in the $T$ step might break the motion and structure of the original video because of error accumulation (Fig. 4 and 9).

In this paper, we propose FateZero, a simple yet effective method for zero-shot video editing since we do not need to train for each target prompt individually and have no user-specific mask . Different from image editing, video editing needs to keep the temporal consistency of the edited video, which is not learned by the original trained text-to-image model. We tackle this problem by using two novel designs. Firstly, instead of solely relying on inversion and generation , we adopt a different approach by storing all the self and cross-attention maps at every step of the inversion process. This enables us to subsequently replace them during the denoising steps of the DDIM pipeline. Specifically, we find these self-attention blocks store better motion information and the cross-attention can be used as a threshold mask for self-attention blending spatially. This attention blending operation can keep the original structures unchanged. Furthermore, we reform the self-attention blocks to the spatial-temporal attention blocks as in to make the appearance more consistent. Powered by our novel designs, we can directly edit the style and the attribute of the real-world video (Fig. 6) using the pre-trained text-to-image model . Also, after getting the video diffusion model (e.g., pretrained Tune-A-Video ), our method shows better object editing (Fig. 5) ability in test-time than simple DDIM inversion . The extensive experiments provide evidence of the advantages offered by the proposed method for both video and image editing.

Our contributions are summarized as follows:

We present the first framework for temporal-consistent zero-shot text-based video editing using pretrained text-to-image model.

We propose to fuse the attention maps in the inversion process and generation process to preserve the motion and structure consistency during editing.

Our novel Attention Blending Block utilizes the source prompt’s cross-attention map during attention fusion to prevent source semantic leakage and improve the shape-editing capability.

We show extensive applications of our method in video style editing, video local editing, video object replacement, etc.

Related Work

Video Editing. Video can be edited via several aspects. For video stylizing editing, current methods rely on the example as the style guide and these methods may fail when the track is lost. By processing frames individually using image style transfer , some works also learn to reduce the temporal consistency in a post-process way. However, the style may still be imperfect since the style transfer only measures the perceptual distance . Several works also show better consistency but on the specific domain, e.g., portrait video . For video local editing, layer-atlas based methods show a promising direction by editing the video on a flattened texture map. However, the 2d atlas lacks 3d motion perception to support shape editing, and prompt-specific optimization is required.

A more challenging topic is to edit the object shape in the real-world video. Current method shows obvious artifacts even with the optimization on generative priors . The stronger prior of the diffusion-based model also draws the attention of current researchers. e.g., gen1 trains a conditional model for depth and text-guided video generation, which can edit the appearance of the generated images on the fly. Dreamix finetunes a stronger diffusion-based video model for editing with stronger generative priors. Both of these methods need privacy and powerful video diffusion models for editing. Thus, the applications of the current larger-scale fine-tuned text-to-image models cannot be used directly.

Image and Video Generation Models. Image generation is a basic and hot topic in computer vision. Early works mainly use VAE or GAN to model the distribution on the specific domain. Recent works adopt VQVAE and transformer for image generation. However, due to the difficulties in training these models, they only work well on the specific domain, e.g., face . On the other hand, the editing ability of these models is relatively weak since the feature space of GAN is high-level, and the quantified tokens can not be considered individually. Another type of method focuses on text-to-image generation. DALL-E and CogView train an image generative pre-training transformer (GPT) to generate images from a CLIP text embedding. Recent models benefit from the stability of training diffusion-based model . These models can be scaled by a huge dataset and show surprisingly good results on text-to-image generation by integrating large language model conditions since its latent space has spatial structure, which provides a stronger edit ability than previous GAN based methods. Generating videos is much more difficult than images. Current methods rely on the larger cascaded models and dataset. Differently, magic-video and gen1 initialize the model from text-to-image and generate the continuous contents through extra time-aware layers. Recently, Tune-A-Video over-fits a single video for text-based video generation. After training, the model can generate related motion from similar prompts. However, how to edit real-world content using this model is still unclear. Inspired by the image editing methods and tune-a-video, our method can edit the style of the real-world video and images using the trained text-to-image model and shows better object replacing performance than the one-shot finetuned video diffusion model with simple DDIM inversion in real videos (Fig. 7).

Image Editing in Diffusion Model. Many recent works adopt the trained diffusion model for editing. SDEdit generates content for a new prompt by adding noise to the image first. DiffEdit computes the edit mask by the noise differences of the text prompts, and then, blends the inversion noises into the image generation process. Similar work has also been proposed by Blended Diffusion , which combines the features of each step for image blending. Plug-and-play gets the inversion noise and applies the denoising for feature reconstruction. After that, the self-attention features in editing are replaced with that in reconstruction directly. Pix2pix-Zero edits the image with the cross-attention guidance. Prompt-to-Prompt proves that images can be edited via reweighting the cross-attention map of different prompts. There are also some methods to achieve better editing ability via optimization . However, a naive frame-wise application of these image methods to video results in flickering and inconsistency among frames.

Methods

We target zero-shot text-driven video editing (e.g., style, attribute, and shape) without optimization for each target prompt or the user-provided mask. In Sec. 3.1, we first give the details of the latent diffusion and DDIM inversion. After that, we introduce our method that enables video appearance editing (Sec. 3.2) via the pre-trained text-to-image models . Finally, we discuss a more challenging case that also enables the shape-aware editing of video using the video diffusion model in Sec. 3.3. Notice that, the proposed method is a general editing method and can be used in various text-to-image or text-to-video models. In this paper, we majorly use Stable Diffusion and the video generation model based on Stable Diffusion (Tune-A-Video ) for its popularity and generalization ability.

Latent Diffusion Models are introduced to diffuse and denoise the latent space of an autoencoder. First, an encoder $\mathcal{E}$ compresses a RGB image $x$ to a low-resolution latent $z=\mathcal{E}(x)$ , which can be reconstructed back to image $\mathcal{D}(z)\approx x$ by decoder $\mathcal{D}$ . Second, a U-Net $\varepsilon_{\theta}$ containing cross-attention and self-attention is trained to remove the artificial noise using the objective:

where $p$ is the embedding of the conditional text prompt and $z_{t}$ is a noisy sample of $z_{0}$ at timestep $t$ .

DDIM Inversion . During inference, deterministic DDIM sampling is employed to convert a random noise $z_{T}$ to a clean latent $z_{0}$ in a sequence of timestep $t:T\rightarrow 1$ :

where $\alpha_{t}$ is a parameter for noise scheduling

Based on the ODE limit analysis of the diffusion process, DDIM inversion is proposed to map a clean latent $z_{0}$ back to a noised latent $\hat{z}_{T}$ in revered steps $t:1\rightarrow T$ :

Such that the inverted latent $\hat{z}_{T}$ can reconstruct a latent $\hat{z}_{0}(p_{src})=\text{DDIM}(\hat{z}_{T},p_{src})$ similar to the clean latent $z_{0}$ at classifier-free guidance scale $s_{cfg}=1$ . Recently, image editing methods use a large classifier-free guidance scale $s_{cfg}\gg 1$ to edit the latent as $\hat{z}_{0}(p_{edit})=\text{DDIM}(\hat{z}_{T},p_{edit})$ (second row in Fig 3(a)), where a reconstruction of $\hat{z}_{0}(p_{src})$ is conducted in parallel to provide attention constraints. (first row in Fig 3(a)).

2 FateZero Video Editing

As shown in Fig. 2, we use the pretrained text-to-image model, i.e., Stable Diffusion, as our base model, which contains a UNet for $T$ -timestep denoising. Instead of straightforwardly exploiting the regular pipeline of latent editing guided by reconstruction attention, we have made several critical modifications for video editing as follows.

Inversion Attention Fusion. Direct editing using the inverted noise results in frame inconsistency, which may be attributed to two factors. First, the invertible property of DDIM discussed in Eq. (2) and Eq. (3) only holds in the limit of small steps . Nevertheless, the present requirements of 50 DDIM denoising steps lead to an accumulation of errors with each subsequent step. Second, using a large classifier-free guidance $s_{cfg}\gg 1$ can increase the edit ability in denoising, but the large editing freedom leads to inconsistent neighboring frames. Therefore, previous methods require optimization of text-embedding or other regularization .

While the issues seem trivial in the context of single-frame editing they can become magnified when working with video as even minor discrepancies among frames will be accentuated along the temporal indexes.

To alleviate these issues, our framework utilizes the attention maps during inversion steps (Eq. (3)), which is available because the source prompt $p_{src}$ and initial latent $z_{0}$ are provided to the UNet during inversion. Formally, during inversion, we store the intermediate self-attention maps $[s_{t}^{\text{src}}]_{t=1}^{T}$ , cross-attention maps $[c_{t}^{\text{src}}]_{t=1}^{T}$ at each timestep $t$ and the final latent feature maps $z_{T}$ as

where DDIM-Inv stands for the DDIM inversion pipeline discussed in Eq. (3). During the editing stage, we can obtain the noise to remove by fusing the attention from inversion:

where $p_{\text{edit}}$ represents the modified prompt. In function Att-Fusion, we inject the cross-attention maps of the unchanged part of the prompt similar to Prompt-to-Prompt . We also replace self-attention maps to preserve the original structure and motion during the style and attribute editing.

Fig. 3 shows a toy comparison example between our attention fusion method and the typical method with simply inversion and then generation as in for image editing. The cross-attention map during inversion captures the silhouette and the pose of the cat in the source image, but the map during reconstruction has a noticeable difference. While in the video, the attention consistency might influence the temporal consistency as shown in Fig. 8. This is because the spatial-temporal self-attention maps represent the correspondence between frames and the temporal modeling ability of existing video diffusion model is not satisfactory.

Spatial-Temporal Self-Attention. The previous two designs make our method a strong editing method that can preserve the better structure, and also a big potential in video editing. However, denoising each frame individually still produces inconsistent video. Inspired by the casual self-attention and recent one-shot video generation method , we reshape the original self-attention to Spatial-Temporal Self-Attention without changing pretrained weights. Specifically, we implement $\textsc{Attention}(Q,K,V)$ for feature $z^{i}$ at temporal index $i\in[1,n]$ as

where $[\cdot]$ denotes the concatenation operation and $W^{Q}$ , $W^{K}$ , $W^{V}$ are the projection matrices from pretrained model. Empirically, we find it is enough to warp the middle frame $\mathbf{z}^{\text{w}}=z^{\text{Round}[\frac{n}{2}]}$ for attribute and style editing. Thus, the spatial-temporal self-attention map is represented as $s^{src}_{t}\in R^{hw\times fhw}$ , where $f=2$ is the number of frames used as key and value. It captures both the structure of a single frame and the temporal correspondence with the warped frames.

Overall, the proposed method produces a new editing method for zero-shot real-world video editing. We replace the attention maps in the denoising steps with their corresponding maps during the inversion steps. After that, we utilize cross-attention maps as masks to prevent semantic leaks. Finally, we reform the self-attention of UNet to spatial-temporal attention for better temporal consistency among different temporal frames. We have included a formal algorithm in the supplementary materials for reference purposes.

3 Shape-Aware Video Editing

Different from appearance editing, reforming the shape of a specific object in the video is much more challenging. To this end, a pretrained video diffusion model is needed. Since there is no publicly-available generic video diffusion model, we perform the editing on the one-shot video diffusion model instead. In this case, we compare our editing method with simple DDIM inversion , where our method also achieves better performance in terms of editing ability, motion consistency, and temporal consistency. It might be because it is hard for an inflated model to overfit the exact motion of the input video. While in our method, the motion and structure are represented by high-quality spatial-temporal attention maps $s^{src}_{t}\in R^{hw\times fhw}$ during inversion, which is further fused with the attention maps during editing. More details can be founded in Fig. 7 and the supp. video.

Experiments

For zero-shot style and attribute editing, we directly use the trained stable diffusion v1.4 as the base model, we fuse the attentions in the interval of $t\in[0.2\times T,T]$ of the DDIM step with total timestep $T=50$ . For shape editing, we utilize the pretrained model of the specific video at 100 iterations and fuse the attention at DDIM timestep $t\in[0.5\times T,T]$ , giving more freedom for new shape generation. Following previous works , we use videos from DAVIS and other in-the-wild videos to evaluate our approach. The source prompt of the video is generated via the image caption model . Finally, we design the target prompt for each video by replacing or adding several words.

2 Applications

Local attribute and global style editing. Using pretrained text-to-image diffusion model , our framework supports zero-shot local attribute and global style editing, as shown in Fig. 6 and third row in Fig.1. In the first row, the texture and color of the feather are modified by the target prompt Swarovski crystal and kept consistent across frames. In the second and third rows, our framework applies abstract style (Ukiyo-e and Makoto Shinkai). The image structure and temporal motion can be well preserved since we fuse both the spatial-temporal self-attention and cross-attention during the inversion and editing stage.

Shape-aware editing. Fig. 5 and the second row in Fig.1 present the result of difficult object shape editing, with a pretrained video model . This task is challenging because a naive full-resolution fusion of the spatial-temporal self-attention maps results in inaccurate shape results and wrong temporal motion, as shown in the ablation (Fig.9). Thanks to the proposed Attention Blending, we combine the motion of generated shape from the editing target and inverted attention from the input video. Results of posche, duck and flamingo show that we generate new content with poses and positions similar to input videos.

Zero-shot image editing. In addition, our framework can serve as a zero-shot image editing method such as local attribute editing (Fig. 3) and object shape editing (Fig. 4) by considering an image as a video with a single frame. We provide more results in our supplementary material.

3 Baseline Comparisons

Since there are no available zero-shot video editing methods based on diffusion models, we build the following four state-of-the-art baselines for comparison. (1) Tune-A-Video overfits an inflated diffusion model on a single video to generate similar content. (2) The Neural Layered Atlas (NLA) based method is combined with keyframe-editing via state-of-the-art image editing methods . (3) Frame-wise Null-text optimization and then edit by prompt2prompt . (4) Frame-wise zero-shot editing using SDEdit . For attention-based editing (2,3,4), we use the same timesteps fusion parameters as ours.

We conduct the quantitative evaluation using the trained CLIP model as previous methods . Specially, we show the ‘Tem-Con’ to measure the temporal consistency in frames by computing the cosine similarity between all pairs of consecutive frames. ‘Frame-Acc’ is the frame-wise editing accuracy, which is the percentage of frames where the edited image has a higher CLIP similarity to the target prompt than the source prompt. In addition, three user studies metrics (denoted as ‘Edit’, ‘Image’, and ‘Temp’) are conducted to measure the editing quality, overall frame-wise image fidelity, and temporal consistency of the video, respectively. We ask 20 subjects to rank different methods with 9 sets of comparisons in each study. From Tab. 1, the proposed zero-shot method achieves the best temporal consistency against baselines and shows a comparable frame-wise editing accuracy as the pre-frame optimization method . As for the user studies, the average ranking of our method earns user preferences the best in three aspects.

To provide a qualitative comparison, Fig.7 provides the results of our method and other baselines at two different frames. The editing result of framewise SDEdit can not be localized and varies a lot among different frames. Frame-wise Null inversion achieves local editing at the cost of 500-iterations optimization for each frame but is still temporally inconsistent. NLA-based method preserves the exact pixels in the atlas. However, it struggles to perform editing that involves new shapes or 3D structures. In addition, it takes hours to optimize the neural atlas for each input video. While Tune-A-Video with DDIM ranks second in editing quality and image fidelity of Tab. 1, we observe that it has difficulty in reproducing the exact motion and spatial position as input video (right side of Fig.7). Besides, the background has annoying artifacts. Different from the above baselines, our method preserves the motion by fusion the attention during inversion and editing. Thus, our results outperform others by a large margin in our user study and frame consistency measured by CLIP.

4 Ablation Studies

Although we have proved the effectiveness of the proposed strategies in Fig. 4 and Fig. 3 using toy image examples, here, we ablate these designs in the video.

Attention during inversion. In the right column of Fig. 8, we use the attention map during reconstruction instead of inversion for zero-shot background editing. The visualized cross-attention map of the word ‘boat’ in the first and last frame can not capture the correct position and structure of the boat, which may be caused by the poor temporal modeling capacity of the image diffusion model and the accumulation of errors in DDIM inversion. In contrast, we propose using attention during inversion as the middle column, which provides stable guidance of semantic layout in the original video. We observe this huge difference in attention maps between inversion and reconstruction exists in most videos.

Attention Blending Block is studied in Fig. 9, where we remove all self-attention fusion or fuse all self-attention without a spatial mask. The third column shows that removing all self-attention maps brings a loss of fine details ( e.g., fences, poles, and trees in the background) and inconsistency of car identity over time. In contrast, if we fuse full-resolution self-attention as in the previous work , the shape editing ability of the framework can be severely degraded so that the geometry of generated car resembles the input video, especially in the last few frames. Therefore, we propose to blend the self-attention maps with a mask obtained from cross-attention to preserve unedited details and ensure temporal consistency while editing the object shape.

Conclusion

In this paper, we propose a new text-driven video editing framework FateZero that performs temporal consistent zero-shot editing of attribute, style, and shape. We make the first attempt to study and utilize the cross-attention and spatial-temporal self-attention during DDIM inversion, which provides fine-grained motion and structure guidance at each denoising step. A new Attention Blending Block is further proposed to enhance the shape editing performance of our framework. Our framework benefits video editing using widely existing image diffusion models, which we believe will contribute to a lot of new video applications.

Limitation & Future Work. While our method achieves impressive results, it still has some limitations. During shape editing, since the motion is produced by the one-shot video diffusion model , it is difficult to generate totally new motion (e.g.,‘swim’ $\xrightarrow{}$ ‘fly’ ) or very different shape (e.g.,‘swan’ $\xrightarrow{}$ ‘pterosaur’). We will test our method on the generic pretrained video diffusion model for better editing abilities.

Acknowledgement This project is supported by the National Key R&D Program of China under grant number 2022ZD0161501. The authors would like to express sincere gratitude to Tencent AI Lab for providing the necessary computation resources and a conducive environment for research. Additionally, the authors extend their appreciation to Xilin Zhang for reviewing and revising the writing, and to all friends at Tencent and HKUST who participated in the user study.

References

Appendix A Implementation Details

Pseudo algorithm code Our full algorithm is shown in Algorithm 1 and Algorithm 2. Algorithm 1 presents the overall framework of our inversion and editing, as visualized in the left of Fig. 1 in the main paper. Algorithm 2 shows that the cross-attention is fused based on a mask of the edited words, and the self-attention is blended using a binary mask from thresholding the cross-attention (the right of Fig. 1 in the main paper).

Hyperparameters Tuning. There are mainly three hyperparameters in our proposed designs: - ${t}_{s}\in[1,T]$ : Last timestep of the self-attention blending. Smaller ${t}_{s}$ fuses more self-attention from inversion to preserve structure and motion. - ${t}_{c}\in[1,T]$ : Last timestep of the cross attention fusion. Smaller ${t}_{c}$ fuses more cross attention from inversion to preserve the spatial semantic layout. - $\tau\in$ : Threshold for the blending mask used in shape editing. Smaller $\tau$ uses more self-attention map from editing to improve shape editing results.

In style and attribute editing, we set ${t}_{s}=0.2T$ , ${t}_{c}=0.3T$ , $\tau=1.0$ to preserve most structure and motion in the source video. In shape editing, we set ${t}_{s}=0.5T$ , ${t}_{c}=0.5T$ , $\tau=0.3$ to give more freedom in new motion and 3D shape generation.

Appendix B Demo Video

we provide a detailed demo video to show:

Video Results on style, local attribute, and shape editing to validate the effectiveness of the proposed method.

Method Animation to provide a better understanding of the proposed method.

Baseline Comparisons with previous methods in video.

More Promising Applications We have shown the effectiveness of the proposed method in the main paper for style, attribution, and shape editing. In the demo video, we also show some potential applications of the proposed method, including (1) object removal by removing the word of the target object in the source prompt and mask the self-attention of the corresponding area using its cross attention, (2) video enhancement by adding the specific prompt (e.g., ‘high-quality’, ‘8K’) in the target editing prompt.

Appendix C Limitation and Future Work

Our zero-shot editing is not good at new concept composition or generation of very different shapes. For example, the result of editing ‘black swan’ to ‘yellow pterosaur’ in Fig 10 is unsatisfactory. This problem may be alleviated using a stronger video diffusion model, which we leave to future work.