Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, Jiaya Jia

Introduction

Video creation and editing are key tasks . Text-driven editing becomes one promising pipeline. Several methods have demonstrated the ability to edit generated or real-world images with text prompts . Till now, it is still challenging to edit only local objects in a video, such as changing a running “dog” into a “cat” without influencing the environment. This paper proposes a pipeline that can edit a video both locally and globally, as shown in Figs. 1 and 5.

Text-driven image editing requires a model capable of generating target content, such as changing the category or property of an object. Diffusion models have demonstrated outstanding generation capabilities in this area . Among these methods, attention control emerges as the most effective pipeline for detailed image editing . In order to edit a real image, this pipeline includes two necessary steps: (1) inverting the image into latent features with a pre-trained diffusion model, and (2) controlling attention maps in the denoising process to edit the corresponding parts of the image. For example, by swapping their attention maps, we can replace a “child” with a “panda”.

In this paper, we aim to build an attention control-based pipeline for video editing. Since no large-scale pre-trained video generation models are publicly available, we propose a novel framework to show that a pre-trained image diffusion model can be adapted for detailed video editing.

While a pre-trained image diffusion model can be utilized for video editing by processing frames individually (Image-P2P), it lacks semantic consistency across frames (the 2nd row of Fig. 2). To maintain semantic consistency, we propose using a structure on inversion and attention control for all frames, by transforming the Text-to-image diffusion model (T2I) into a Text-to-set model (T2S). This approach is effective, as illustrated in the 3rd row, where the robotic penguin maintains its consistency across frames.

We adopt the method proposed in to convert a T2I model into a T2S model by altering the convolution kernels and replacing the self-attentions with frame-attentions. This conversion yields a model that generates a set of semantically consistent images. The generation quality will be degraded with the inflation step but it can be recovered after tuning on the original video. Although the tuned T2S model is not an ideal video generation model, it suffices to create an approximate inversion for a video as shown in Fig. 3 (c). It is just an approximation because errors are accumulated in the denoising pass, consistent with conclusions in .

To improve the inversion quality, we propose to optimize a shared unconditional embedding for all frames to align the denoising latent features with the diffusion latent features. Our experiments show that shared embedding is the most efficient and effective choice for video inversion. Comparisons are shown in Fig. 3.

As discussed in , successful attention control requires a model to have both reconstruction ability and editability. While image inversion has been argued to possess both abilities in , we find that video editing presents different challenges. The T2S model, as an inflation model not trained on any videos, is not robust to the perturbations caused by various unconditional embeddings. Although our optimized embedding can achieve reconstruction, changing prompts can destabilize the model and result in a low-quality generation. On the other hand, we find that the approximate inversion with an initialized unconditional embedding is editable but cannot reconstruct well. To address this issue, we propose a decoupled-guidance strategy in attention control, utilizing different guidance strategies for the source and target prompts. Specifically, we use the optimized unconditional embedding for the source prompt and the initialized unconditional embedding for the target prompt. We incorporate the attention maps from these two branches to generate the target video. These two simple designs prove effective and successfully complete video editing. Our contributions can be summarized as:

We propose the first framework for video editing with attention control. A decoupled-guidance strategy is designed to further improve performance.

We introduce an efficient and effective video inversion method with shared unconditional embedding optimization to improve video editing substantially.

We conduct extensive ablation studies and comparisons to show the effectiveness of our video editing framework.

Related Work

DALL-E first considers the text-to-image (T2I) generation task as a sequence-to-sequence translation problem, with subsequent research improving generation quality . Denoising Diffusion Probabilistic Models (DDPMs) have recently gained popularity for T2I. GLIDE utilizes classifier-free guidance to improve text conditioning. DALLE-2 leverages CLIP for better text-image alignment. Latent Diffusion Models (LDMs) propose processing in the latent space to enhance training efficiency. In our work, we employ a pre-trained image diffusion model based on LDMs.

Text-to-video (T2V) generation is a nascent research area. GODIVA first introduces VQ-VAE to T2V. CogVideo combines T2V with CogView-2 , utilizing pre-trained text-to-image models. Video Diffusion Models (VDM) propose a space-time U-Net for performing diffusion on pixels. Imagen Video successfully generates high-quality videos with cascaded diffusion models and v-prediction parameterization. Phenaki generates videos with time-variable prompts. Make-A-Video combines the appearance generation of T2I models with movement information from video data. While these approaches generate reasonable short videos, they still contain artifacts and do not support real-world video editing. Additionally, most of these approaches are not publicly available at this time.

Several single-video generative models have been proposed. Single-video GANs can generate novel videos with similar objects and motions, while SinFusion uses diffusion models to improve generalization but is limited to simple cases. Tune-A-Video inflates an image diffusion model into a video model and tunes it to reconstruct the input video. It allows for changes in semantic content but with limited temporal consistency. We find that using DDIM inversion results can improve its temporal consistency. However, it cannot avoid altering unrelated regions. We adapt some designs of TAV to do our model initialization.

2 Text Driven Editing

Generative models have demonstrated impressive performance in image editing, with approaches ranging from GANs to diffusion models . SDEdit adds noise to an input image and uses the diffusion process to recover an edited version. Prompt-to-Prompt and Plug-and-Play use attention control to minimize changes to unrelated parts, while Null-Text Inversion improves real image editing. InstructPix2Pix enables flexible text-driven editing with user-provided instructions. Textual Inversion , DreamBooth , and Custom-Diffusion learn special tokens for personalized concepts and generate related images.

Video editing with generative models has seen several advances recently. Text2Live employs CLIP to edit textures in videos but struggles with significant semantic changes. Dreamix uses a pre-trained Imagen Video backbone to perform image-to-video and video-to-video editing, with the ability to change motion as well. Gen-1 trains models jointly on images and videos for tasks such as stylization and customization. While these methods enable modifying video content, they operate like guided generation and tend to modify all regions together when editing an object. Our proposed method allows for local editing with a diffusion model pre-trained on images.

Method

Let $\mathcal{V}$ be a real video containing $n$ frames. We adopt the Prompt-to-Prompt setting by introducing a source prompt $\mathcal{P}$ and an edited prompt $\mathcal{P}^{*}$ which together generate an edited video $\mathcal{V}^{*}$ containing $n$ frames. The prompts are provided by the user.

To achieve cross-attention control in video editing, we propose Video-P2P, a framework with two key technical designs: (1) optimizing a shared unconditional embedding for video inversion, and (2) using different guidance for the source and edited prompts, and incorporating their attention maps. The framework is illustrated in Fig. 4.

LDMs generate an image latent $z_{0}$ using a random noise vector $z_{t}$ and a textual condition $P$ as inputs. As variants of DDPMs, these models aim to predict artificial noise by minimizing the following objective:

where $\mathcal{C}=\psi(\mathcal{P})$ is the embedding of the text prompt, and noise $\varepsilon$ is added to $z_{0}$ according to step $t$ to obtain $z_{t}$ . During inference, the model predicts noise $\varepsilon_{\theta}(\cdot)$ for $T$ steps to generate an image from $z_{T}$ .

DDIM sampling and inversion.

Deterministic DDIM sampling can be used to generate an image from latent features in a small number of denoising steps:

We use an encoder to encode the real image before the diffusion process and a decoder to decode after the denoising process. DDIM sampling can be reversed in a few steps through the equation:

known as DDIM inversion . This can be used to obtain the corresponding latent features of a real image.

Null-text inversion.

To mitigate the amplification effect of text conditioning during image generation, classifier-free guidance is proposed, which performs unconditional prediction :

where $\varnothing=\psi("")$ is the embedding of a null text and $w$ is the guidance weight. However, the classifier-free guidance increases errors accumulated in the denoising process, leading to imperfect image reconstruction using the DDIM inversion. proposes to align the diffusion latent trajectory $z_{T}^{*},\ldots,z_{0}^{*}$ with the denoising latent trajectory $z_{T},\ldots,z_{0}$ by optimizing a step-wise unconditional embedding $\varnothing_{t}$ :

2 Video Inversion

We begin by constructing a T2S model that is capable of performing an approximate inversion. Following the VDM baselines and TAV , we employ $1\times 3\times 3$ pattern convolution kernels and temporal attention. Moreover, we replace the self-attentions with frame-attentions, which take the first frames $v_{0}$ and the current frame $v_{i}$ as inputs and update features for the frame $v_{i}$ . The formulation of the frame-attention is as follows:

where $W$ are the projection matrices in attention. The model processes a video pair-by-pair and computes $n$ times to obtain the prediction for every frame. While the Sparse-causal attention proposed in TAV outperforms frame-attention when generating videos from random noise, we find that the simple design suffices for video inversion since the reversed latent features can capture temporal information. Additionally, frame-attention conserves memory and speeds up the process.

While model inflation can aid in preserving semantic consistency across frames, it adversely impacts the generation quality of the T2I model. This is because the self-attention parameters are utilized to compute frame correlations, which have not been pre-trained. Consequently, the T2S model, generated through inflation, is insufficient for the approximate inversion, as demonstrated in Fig. 2. To address this, we fine-tune the query projection matrices $W^{Q}$ of the frame- and cross-attentions, as well as additional temporal attention, to perform noise prediction based on the input video following . After this initialization, the T2S model is capable of generating semantically consistent image sets while maintaining the quality of each frame, resulting in successful approximate inversion.

Using the fine-tuned T2S model, we perform video inversion by optimizing a shared unconditional embedding. During inversion, each latent feature $z_{t}$ contains a channel for the frames with dimension $n$ , where $z_{t,i}$ denotes the latent feature for the $i$ -th frame. We employ DDIM inversion to generate latent features $z_{0}^{*},\ldots,z_{T}^{*}$ . The unconditional embedding is defined as follows:

is updated at each step. The T2S model’s frame-attentions use two latent features to calculate the corresponding feature for the next step. Notice $\varnothing_{t}$ is shared by all frames ( $i=1,\ldots,n$ ) which minimizes the memory usage. Besides, using the same unconditional embedding for all frames avoids destabilizing the semantic consistency in attention control.

3 Decoupled-guidance Attention Control

To perform attention control on real images, existing works require an inference pipeline with both reconstruction ability and editability. However, achieving such a pipeline for a T2S model is challenging. Video inversion allows us to establish an inference pipeline to reconstruct the original video well. However, the T2S model is not as robust as T2I models due to a lack of pre-training with videos. As a result, its editability is compromised with the optimized unconditional embedding, leading to degraded generation quality when changing prompts. In contrast, we find that using an initialized unconditional embedding makes the model more editable while it cannot reconstruct perfectly. This inspires us to combine the abilities of two inference pipelines. For the source prompt, we use the optimized unconditional embedding in the classifier-free guidance. For the target prompt, we choose the initialized unconditional embedding. We then incorporate attention maps from these two branches to obtain the edited video, where the unchanged parts are influenced by the source branch and the edited parts are influenced by the target branch.

The pseudo algorithm is shown in Alg. 1. We adopt the attention control methods from Image-P2P to Video-P2P. For example, to perform word swap, the $Edit$ function can be represented as:

$M_{t}$ and $M_{t}^{*}$ are the cross-attention maps for every frame at every step, and $DM$ is the tuned T2S model. Changing the frame-attentions maps has a small influence on the final results. Attention maps are swapped only for the first $\tau$ steps because attentions are formed in the early period. $\overline{M}_{t,w}$ is the average attention map of the word $w$ calculated at step $t$ . It is averaged over steps $T,\ldots,t$ independently for every frame. For the $j$ -th frame, we calculate:

$B\left(\overline{M}_{t,w}\right)$ represents the binary mask obtained from the attention map. A value is set to 1 when larger than a threshold.

Experiments

We develop our method based on CompVis Stable Diffusion (v1-5). Similar to TAV , we fix the image autoencoder and sample 8 or 24 frames at the resolution of 512 $\times$ 512 from a video. To initialize the model, we fine-tune the T2S model for 500 steps to reconstruct the original video. During attention control, we set the cross-attention replacing ratio to 0.4 and the attention threshold to 0.3. For prompt refinement, we set the refinement ratio to 0.4. These parameters can be adjusted to control the editing fidelity for different examples. All 8-frame experiments are conducted on a single V100 GPU, with 5 minutes for initialization (tuning), 6 minutes for inversion, and 1 minute for inference.

2 Applications

Our Video-P2P method can be utilized for a range of editing applications, including prompt refinement, attention re-weighting, and word swapping, similar to the capabilities of image-P2P. Video-P2P is able to maintain semantic consistency across different frames and preserve the temporal coherence of the original video during the editing process. More examples can be found in the appendix.

Video-P2P enables the replacement of entities based on word swapping while maintaining the coherence of unrelated regions. As illustrated in Fig. 5, Video-P2P seamlessly replaces the man on the motorbike with Spider-Man while minimizing the changes to the motorbike’s appearance (the 4th row). The generated Spider-Man exhibits a consistent appearance across frames, and the background remains unchanged. Furthermore, we can replace a dog with a cat while preserving its gesture and the surrounding grass (the 5th row).

Prompt refinement.

Video-P2P is able to do prompt refinement, such as modifying object properties. For example, we can transform the running dog into a robotic one (the 6th row in Fig. 5), and convert a motorbike into a Lego toy with the same motion (the 3rd row). Notice the grass and sky are almost not influenced. Additionally, Video-P2P can perform global editing like changing the weather to sunset or flooding the road with water (2nd row). Style transfer can also be accomplished by Video-P2P, as exemplified by transforming the video into a watercolor painting.

Attention re–weighting.

Similar to Image-P2P, Video-P2P also enables attention re-weighting. By adjusting the cross-attention of specific words, we can manipulate the extent of the corresponding generation. For instance, we can regulate how fluffy a dog is in the video (the 6th row of Fig. 5).

3 Comparison

Both TAV+DDIM and our Video-P2P allow for video editing with text prompts. However, TAV+DDIM cannot avoid altering the entire video content when editing specific objects, while Video-P2P can edit a local area and minimize the influence on other regions. Fig. 6 (Left) demonstrates that Video-P2P preserves the complex shape of the cloud when replacing a lion with King Kong, whereas TAV+DDIM can only maintain the color tone of the sky in this case.

Although our model initialization is similar to TAV, Video-P2P can still generate temporal-consistent results where TAV+DDIM fails. As demonstrated in Fig. 6 (Right), TAV struggles to generate a temporally consistent sequence in the second row, even when the inputs are features from DDIM inversion. In contrast, our method can produce better structure-preserved results, as shown in the third row.

Comparison with Dreamix.

In contrast to Dreamix , which uses a pre-trained video diffusion model that is not publicly available, our method yields superior results for subject replacement. Although our method cannot perform video motion editing due to the lack of temporal priors, we outperform Dreamix in preserving details and motion consistency. As Dreamix is not open-sourced, we conducted our evaluation on its released demo. As demonstrated in Fig. 7, both methods can transform two dogs into two cats, but our method preserves the details of the drawer in the background (the 3rd row). Furthermore, Dreamix may affect the time sequence to some extent, as the generated cat moves more slowly than the original dog in the video. In contrast, our method completely preserves the motion of the original video.

Quantitative results.

We evaluate our proposed Video-P2P on 10 YouTube videos and report four metrics for quantitative analysis. The CLIP Score measures the textual similarity between the text prompt and video, while Masked PSNR and LPIPS evaluate the quality of structure preservation. We also proposed a novel metric, Object Semantic Variance (OSV), to measure semantic consistency across frames. For detailed explanations of these metrics, please refer to the appendix. Our results, as shown in Table 1, demonstrate that Video-P2P performs well on all metrics. Compared to TAV+DDIM, Video-P2P achieves higher Masked PSNR and lower LPIPS, indicating better preservation of unchanged regions. Compared to the other two methods, Video-P2P has a much lower OSV, indicating its superior ability to maintain semantic consistency across frames. Moreover, in Tab. 3, we report the user study results, where Video-P2P ranks first on average and has a high preference rate compared to other methods.

4 Ablation Study

While the inflated image diffusion model can generate semantically consistent images, the T2S model’s generation ability is compromised during inflation, making it inadequate for video inversion even with an optimized unconditional embedding. As seen in Fig.8 (the 3rd column), directly using the inflated T2S model produces unrealistic results with an inaccurate background. To mitigate this, we initialize the T2S mode by fine-tuning the given video. This is evident in Fig.8 (4th column), where the cat’s appearance improves, and the grass reconstruction becomes more accurate.

Shared unconditional embedding.

Table 2 presents the quantitative results for video inversion. We observe that optimizing a shared unconditional embedding can significantly improve the PSNR compared to TAV+DDIM. However, using multiple unconditional embeddings for each frame only increases the PSNR by 0.2 but results in a higher parameters usage ( $n$ times). Besides, we find that using multiple unconditional embeddings leads to a lower Masked PSNR of 20.51 after attention control compared to the shared unconditional embedding. Thus, we conclude that shared unconditional embedding is the most effective and efficient method for video inversion.

Decoupled-guidance attention control.

To obtain the latent features of the input video, we optimize an unconditional embedding using the source prompt. It is important to note that this embedding is only suitable for the source prompt during the prompt-to-prompt process. Using the optimized embedding for the target prompt may negatively impact the quality of the generated results, as shown in Fig. 9 (1st row). Instead, we utilize the initialized unconditional embedding for the target prompt and incorporate attention maps from two branches. The decoupled-guidance attention control approach significantly improves the editing quality, as shown in Fig.9 (the 2nd row). Quantitative ablations can be found in Tab. 1 (the 3rd row and 4th row).

Conclusion

Our proposed approach, Video-P2P, provides a simple yet effective solution for video editing with cross-attention control. By leveraging a pre-trained image diffusion model, we demonstrate that editing a video locally and globally is possible. Specifically, we optimize a shared unconditional embedding based on a well-initialized T2S model for video inversion. We also propose using different unconditional embeddings for source and target prompts, and integrating attention maps from two branches for improved attention control. These techniques enable Video-P2P to perform various applications, such as word swap, prompt refinement, and attention re-weighting. In future work, we will enhance its capability to handle more complex editing tasks like injecting extra objects.