Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy

Introduction

Recent text-to-image diffusion models such as DALLE-2 , Imagen , Stable Diffusion demonstrate exceptional ability in generating diverse and high-quality images guided by natural language. Based on it, a multitude of image editing methods have emerged, including model fine-tuning for customized object generation , image-to-image translation , image inpainting , and object editing . These applications allow users to synthesize and edit images effortlessly, using natural language within a unified diffusion framework, greatly improving creation efficiency. As video content surges in popularity on social media platforms, the demand for more streamlined video creation tools has concurrently risen. Yet, a critical challenge remains: the direct application of existing image diffusion models to videos leads to severe flickering issues.

Researchers have recently turned to text-guided video diffusion models and proposed three solutions. The first solution involves training a video model on large-scale video data , which requires significant computing resources. Additionally, the re-designed video model is incompatible with existing off-the-shelf image models. The second solution is to fine-tune image models on a single video , which is less efficient for long videos. Overfitting to a single video may also degrade the performance of the original models. The third solution involves zero-shot methods that require no training. During the diffusion sampling process, cross-frame constraints are imposed on the latent features for temporal consistency. The zero-shot strategy requires fewer computing resources and is mostly compatible with existing image models, showing promising potential. However, current cross-frame constraints are limited to global styles and are unable to preserve low-level consistency, e.g., the overall style may be consistent, but the local structures and textures may still flicker.

Achieving successful application of image diffusion models to the video domain is a challenging task. It requires 1) Temporal consistency: cross-frame constraints for low-level consistency; 2) Zero-shot: no training or fine-tuning required; 3) Flexibility: compatible with off-the-shelf image models for customized generation. As mentioned above, image models can be customized by fine-tuning on specific objects to capture the target style more precisely than general models. Figure 2 shows two examples. To take advantage of it, in this paper, we employ zero-shot strategy for model compatibility and aim to further solve the key issue of this strategy in maintaining low-level temporal consistency.

To achieve this goal, we propose novel hierarchical cross-frame constraints for pre-trained image models to produce coherent video frames. Our key idea is to use optical flow to apply dense cross-frame constraints, with the previous rendered frame serving as a low-level reference for the current frame and the first rendered frame acting as an anchor to regulate the rendering process to prevent deviations from the initial appearance. Hierarchical cross-frame constraints are realized at different stages of diffusion sampling. In addition to global style consistency, our method enforces consistency in shapes, textures and colors at early, middle and late stages, respectively. This innovative and lightweight modification achieves both global and local temporal consistency. Figure 1 presents our coherent video translation results over off-the-shelf image models customized for six unique styles.

Based on the insight, this paper introduces a novel zero-shot framework for text-guided video-to-video translation, consisting of two parts: key frame translation and full video translation. In the first part, we adapt pre-trained image diffusion models with hierarchical cross-frame constraints for generating key frames. In the second part, we propagate the rendered key frames to other frames using temporal-aware patch matching and frame blending. The diffusion-based generation is excellent at content creation, but its multi-step sampling process is inefficient. The patch-based propagation, on the other hand, can efficiently infer pixel-level coherent frames but is not capable of creating new content. By combining these two parts, our framework strikes a balance between quality and efficiency. To summarize, our main contributions are as follows:

A novel zero-shot framework for text-guided video-to-video translation, which achieves both global and local temporal consistency, requires no training, and is compatible with pre-trained image diffusion models.

Hierarchical cross-frame consistency constraints to enforce temporal consistency in shapes, textures and colors, which adapt image diffusion models to videos.

Hybrid diffusion-based generation and patch-based propagation to strike a balance between quality and efficiency.

Related Work

Generating images with descriptive sentences is intuitive and flexible. Early attempts explore GAN to synthesize realistic images. With the powerful expressivity of Transformer , autoregressive models are proposed to model image pixels as a sequence with autoregressive dependency between each pixel. DALL-E and CogView train an autoregressive transformer on image and text tokens. Make-A-Scene further considers segmentation masks as condition.

Recent studies focus on diffusion models for text-to-image generation, where images are synthesized via a gradual denoising process. DALLE-2 and Imagen introduce pretrained large language models as text encoder to better align the image with text, and cascade diffusion models for high resolution image generation. GLIDE introduces classifier-free guidance to improve text conditioning. Instead of applying denoising in the image space, Latent Diffusion Models uses the low-resolution latent space of VQ-GAN to improve the efficiency. We refer to for a thorough survey.

In addition to diffusion models for general images, customized models are studied. Textual Inversion and DreamBooth learn special tokens to capture novel concepts and generate related images given a small number of example images. LoRA accelerates the fine-tuning large models by learning low-rank weight matrices added to existing weights. ControlNet fine-tunes a new control path to provide pixel-level conditions such as edge maps and pose, enabling fine-grained image generation. Our method does not alter the pre-trained model, thus is orthogonal to these existing techniques. This empowers our method to leverage DreamBooth and LoRA for better customized video translation and to use ControlNet for temporal-consistent structure guidance as in Fig. 2.

2 Video Editing with Diffusion Models

For text-to-video generation, Video Diffusion Model proposes to extend the 2D U-Net in image model to a factorized space-time UNet. Imagen Video scales up the Video Diffusion Model with a cascade of spatial and temporal video super-resolution models, which is further extended to video editing by Dreamix . Make-A-Video leverages video data in an unsupervised manner to learn the movement to drive the image model. Although promising, the above methods need large-scale video data for training.

Tune-A-Video instead inflates an image diffusion model into a video model with cross-frame attention, and fine-tunes it on a single video to generate videos with related motion. Based on it, Edit-A-Video , Video-P2P and vid2vid-zero utilize Null-Text Inversion for precise inversion to preserve the unedited region. However, these models need fine-tuning of the pre-trained model or optimization over the input video, which is less efficient.

Recent developments have seen the introduction of zero-shot methods that, by design, operate without any training phase. Thus, these methods are naturally compatible with pre-trained diffusion variants like InstructPix2Pix or ControlNet to accept more flexible conditions like depth and edges. Based on the editing masks detected by Prompt2Prompt to indicate the channel and spatial region to preserve, FateZero blends the attention features before and after editing. Text2Video-Zero translates the latent to directly simulate motions and Pix2Video matches the latent of the current frame to that of the previous frame. All the above methods largely rely on cross-frame attention and early-step latent fusion to improve temporal consistency. However, as we will show later, these strategies predominantly cater to high-level styles and shapes, and being less effective in maintaining cross-frame consistency at the level of texture and detail. In contrast to these approaches, our method proposes a novel pixel-aware cross-frame latent fusion, which non-trivially achieves pixel-level temporal consistency.

Another zero-shot solution is to apply frame interpolation to infer the videos based on one or more diffusion-edited frames. The seminal work of image analogy migrates the style effect from an exemplar pair to other images with patch matching. Fišer et al. extend image analogy to facial video translation with the guidance of facial features. Later, Jamrivška et al. propose an improved EbSynth for general video translation based on multiple exemplar frames with a novel temporal blending approach. Although these patch-based methods can preserve fine details, their temporal consistency largely relies on the coherence across the exemplar frames. Thus, our adapted diffusion model for generating coherent frames is well suited for these methods, as we will show later in Fig. 11. In this paper, we integrate the zero-shot EbSynth into our framework to achieve better temporal consistency and accelerate inference without any further training.

Preliminary: Diffusion Models

Stable Diffusion is a latent diffusion model operating in the latent space of an autoencoder $\mathcal{D}(\mathcal{E}(\cdot))$ , where $\mathcal{E}$ and are the encoder and decoder, respectively. Specifically, for an image $I$ with its latent feature $x_{0}=\mathcal{E}(I)$ , the diffusion forward process iteratively add noises to the latent

where $t=1,...,T$ is the time step, $q(x_{t}|x_{t-1})$ is the conditional density of $x_{t}$ given $g_{t-1}$ , and $\alpha_{t}$ is hyperparameters. Alternatively, we can directly sample $x_{t}$ at any time step from $x_{0}$ with,

where $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ .

Then in the diffusion backward process, a U-Net $\epsilon_{\theta}$ is trained to predict the noise of the latent to iteratively recover $x_{0}$ from $x_{T}$ . Given a large $T$ , $x_{0}$ will be completely destroyed in the forward process so that $x_{T}$ approximates a standard Gaussian distribution. Therefore, $\epsilon_{\theta}$ correspondingly learns to infer valid $x_{0}$ from random Gaussian noises. Once trained, we can sample $x_{t-1}$ based on $x_{t}$ with a deterministic DDIM sampling :

where $\hat{x}_{t\rightarrow 0}$ is the predicted $x_{0}$ at time step $t$ ,

and $\epsilon_{\theta}(x_{t},t,c_{p})$ is the predicted noise of $x_{t}$ based on the time step $t$ and the text prompt condition $c_{p}$ .

During inference, we can sample a valid $x_{0}$ from the standard Guassian noise $x_{T}=z_{T},z_{T}\sim\mathcal{N}(0,\mathbf{I})$ with DDIM sampling, and decode $x_{0}$ to the final generated image $I^{\prime}=\mathcal{D}(x_{0})$ .

Although flexible, natural language has limited spatial control over the output. To improve spatial controllability, introduce a side path called ControlNet to Stable Diffusion to accept extra conditions like edges, depth and human pose. Let $c_{f}$ be the extra condition, the noise prediction of U-Net with ControlNet becomes $\epsilon_{\theta}(x_{t},t,c_{p},c_{f})$ . Compared to InstructPix2Pix, ControlNet is orthogonal to customized Stable Diffusion models. To build a general zero-shot V2V framework, we use ControlNet to provide structure guidance from the input video to improve temporal consistency.

Zero-Shot Text-Guided Video Translation

Given a video with $N$ frames $\{I_{i}\}_{i=0}^{N}$ , our goal is to render it into a new video $\{I^{\prime}_{i}\}_{i=0}^{N}$ in another artistic expression specified by text prompts and/or off-the-shelf customized Stable Diffusion models. Our framework consists of two parts: Key Frame Translation (Sec. 4.1) and Full Video Translation (Sec. 4.2). In the first part, we introduce four hierarchical cross-frame constraints into pre-trained image diffusion models, guiding the rendering of coherent key frames using anchor and previous key frames, as as illustrated in Fig. 3. Then in the second part, non-key frames are interpolated based on their neighboring two key frames. Thus our framework can fully exploit the relationship between different frames to enhance temporal consistency of the outputs.

Figure 4 illustrates the $T$ -step sampling pipeline for the key frame translation. Following SDEdit , the pipeline begins with $x_{T}=\sqrt{\bar{\alpha}_{T}}x_{0}+(1-\bar{\alpha}_{T})z_{T},z_{T}\sim\mathcal{N}(0,\mathbf{I})$ , the noisy latent code of the input video frame rather than the pure Gaussian noise. It enables users to determine how much detail of the input frame is preserved in the output by adjusting $T$ , i.e., smaller $T$ retain more detail. Then, during sampling each frame, we use the first frame as anchor frame and its previous frame to constrain global style consistency and local temporal consistency.

Specifically, cross-frame attention is applied to all sampling steps for global style consistency (Sec. 4.1.1). In addition, in early steps, we fuse the latent feature with the aligned latent feature of previous frame to achieve rough shape alignments (Sec. 4.1.2). Then in mid steps, we use the latent feature with the encoded warped anchor and previous outputs to realize fine texture alignments (Sec. 4.1.3). Finally, in late steps, we adjust the latent feature distribution for color consistency (Sec. 4.1.4). For simplicity, we will use $\{I_{i}\}_{i=0}^{N}$ to refer to the key frames in this section. We summarize important notations in Table 1.

Similar to other zero-shot video editing methods , we replace self-attention layers in the U-Net with cross-frame attention layers to regularize the global style of $I^{\prime}_{i}$ to match that of $I^{\prime}_{1}$ and $I^{\prime}_{i-1}$ . In Stable Diffusion, each self-attention layer receives the latent feature $v_{i}$ (for simplicity we omit the time step $t$ ) of $I_{i}$ , and linearly projects $v_{i}$ into query, key and value $Q$ , $K$ , $V$ to produce the output by $\textit{Self\_Attn}(Q,K,V)=\textit{Softmax}(\frac{QK^{T}}{\sqrt{d}})\cdot V$ with

where $W^{Q}$ , $W^{K}$ , $W^{V}$ are pre-trained matrices for feature projection. Cross-frame attention, by comparison, uses the key $K^{\prime}$ and value $V^{\prime}$ from other frames (we use the first and previous frames), i.e., $\textit{CrossFrame\_Attn}(Q,K^{\prime},V^{\prime})=\textit{Softmax}(\frac{QK^{\prime T}}{\sqrt{d}})\cdot V^{\prime}$ with

Intuitively, self-attention can be thought as patch matching and voting within a single frame, while cross-frame attention seeks similar patches and fuses the corresponding patches from other frames, meaning the style of $I^{\prime}_{i}$ will inherit that of $I^{\prime}_{1}$ and $I^{\prime}_{i-1}$ .

1.2 Shape-aware cross-frame latent fusion

Cross-frame attention is limited to global style. To constrain the cross-frame local shape and texture consistency, we use optical flow to warp and fuse the latent features. Let $w^{i}_{j}$ and $M^{i}_{j}$ denote the optical flow and occlusion mask from $I_{j}$ to $I_{i}$ , respectively. Let $x_{t}^{i}$ be the latent feature for $I^{\prime}_{i}$ at time step $t$ . We update the predicted $\hat{x}_{t\rightarrow 0}$ in Eq. (3) by

$w$ and $M$ are downsampled to match the resolution of $x$ (we omit the downsampling operation for simplicity in this paper). For the reference frame $I_{j}$ , we experimentally find that the anchor frame ( $j=0$ ) provides better guidance than the previous frame ( $j=i-1$ ). We observe that interpolating elements in the latent space can lead to blurring and shape distortion in the late steps. Therefore, we limit the fusion to only early steps for rough shape guidance.

1.3 Pixel-aware cross-frame latent fusion

To constrain the low-level texture features in mid steps, instead warping the latent feature, we can alternatively warp previous frames and encode them back to the latent space for fusion in an inpainting manner. However, the lossy autoencoder introduces distortions and color bias that easily accumulate along the frame sequence. Figure 5(b) shows an example of the distorted result after encoding and decoding 10 times. solved this problem by fine-tuning the decoder’s weights to fit each image, which is impractical for long videos. To efficiently solve this problem, we propose a novel fidelity-oriented zero-shot image encoding method.

Our key insight is the observation that the amount of information lost each time in the iterative auto-encoding process is consistent. Therefore, we can predict the information loss for compensation. Specifically, for arbitrary image $I$ , we encode and decode it twice, obtaining $x^{r}_{0}=\mathcal{E}(I),I_{r}=\mathcal{D}(x^{r}_{0})$ and $x^{rr}_{0}=\mathcal{E}(I_{r}),I_{rr}=\mathcal{D}(x^{rr}_{0})$ . We assume the loss from the target lossless $x_{0}$ to $x^{r}_{0}$ is linear to that from $x^{r}_{0}$ to $x^{rr}_{0}$ . Then we define the encoding $\mathcal{E}^{\prime}$ with compensation as

where we find the linear coefficient $\lambda_{\mathcal{E}}=1$ works well. We further add a mask $M_{\mathcal{E}}$ to prevent the possible artifacts introduced by compensation (e.g., blue artifact near the eyes in Fig. 5(c)). $M_{\mathcal{E}}$ indicates where the error between $I$ and $\mathcal{D}(\mathcal{E}^{\prime}(I))$ is under a pre-defined threshold. Then, our novel fidelity-oriented image encoding $\mathcal{E}^{*}$ takes the form of

The encoding pipeline is summarized in Fig. 6. As shown in Fig. 5(d), our method preserves image information well even after encoding and decoding 10 times.

As illustrated in Fig. 7, for pixel-level coherence, we warp the anchor frame $I^{\prime}_{0}$ and the previous frame $I^{\prime}_{i-1}$ to the $i$ -th frame and overlay them on a rough rendered frame $\bar{I}^{\prime}_{i}$ obtained without the pixel-aware cross-frame latent fusion as

1.4 Color-aware adaptive latent adjustment

Finally, we apply AdaIN to $\hat{x}^{i}_{t\rightarrow 0}$ to match its channel-wise mean and variance to $\hat{x}^{1}_{t\rightarrow 0}$ in the late steps. It can further keep the color style coherent throughout the whole key frames.

2 Full Video Translation

For frames with similar content, existing frame interpolation methods like Ebsynth can generate plausible results by propagating the rendered frames to their neighbors efficiently. However, compared to diffusion models, frame interpolation cannot create new content. To balance between quality and efficiency, we propose a hybrid framework to render key frames and other frames with the adapted diffusion model and Ebsynth, respectively.

Specifically, we sample the key frames uniformly for every $K$ frame, i.e., $I_{0},I_{K},I_{2K},...$ and render them to $I^{\prime}_{0},I^{\prime}_{K},I^{\prime}_{2K},...$ by our adapted diffusion model. We then render the remaining non-key frames. Taking $I_{i}$ ( $0<i<K$ ) for example, we adopt Ebsynth to interpolate $I^{\prime}_{i}$ with its neighboring stylized key frames $I^{\prime}_{0}$ and $I^{\prime}_{K}$ . Ebsynth has two steps of frame propagation and frame blending. In the following, we will briefly introduce the main idea of these two steps and discuss how we adapt Ebsynth to our framework. For implementation details, please refer to .

Frame propagation aims to warp the stylized key frame to its neighboring non-key frames based on their dense correspondences. We directly follow Ebsynth to adopt a guided path-matching algorithm with color, positional, edge, and temporal guidance for dense correspondence prediction and frame warping. Our framework propagates each key frame to its preceding $K-1$ and succeeding $K-1$ frames. We denote the result of propagating $I^{\prime}_{j}$ to $I_{i}$ as $I^{\prime j}_{i}$ . For $I_{i}$ ( $0<i<K$ ), we will obtain two results $I^{\prime 0}_{i}$ and $I^{\prime K}_{i}$ from its nearby key frames $I^{\prime}_{0}$ and $I^{\prime}_{K}$ .

2.2 Temporal-aware blending

Frame blending aims to blend $I^{\prime 0}_{i}$ and $I^{\prime K}_{i}$ to a final result $I^{\prime}_{i}$ . Ebsynth proposes a three-step blending scheme: 1) Combining colors and gradients of $I^{\prime 0}_{i}$ and $I^{\prime K}_{i}$ by selecting the ones with lower errors during patch matching (Sec. 4.2.1) for each location; 2) Using the combined color image as a histogram reference for contrast-preserving blending over $I^{\prime 0}_{i}$ and $I^{\prime K}_{i}$ to generate an initial blended image; 3) Employing the combined gradient as a gradient reference for screened Poisson blending over the initial blended image to obtain the final result. Differently, our framework only adopts the first two blending steps and uses the initial blended image as $I^{\prime}_{i}$ . We do not apply Poisson blending, which we find sometimes causes artifacts in non-flat regions and is relatively time-consuming.

Experimental Results

The experiment is conducted on one NVIDIA Tesla V100 GPU. We employ the fine-tuned and LoRA models based on Stable Diffusion 1.5 from https://civitai.com/. We use Stable Diffusion originally uses $T_{max}=1000$ steps. For the sampling pipeline in Fig. 4(b), by default, we set $T_{s}=0.1T_{max}$ , $T_{p0}=0.5T_{max}$ , $T_{p1}=0.8T_{max}$ and $T_{a}=0.8T_{max}$ and use 20 steps of DDIM sampling. We tune $T$ for each video. ControlNet is used to provide structure guidance in terms of edges, with the control weight tuned for each video. We use GMFlow for optical flow estimation and compute the occlusion masks by forward-backward consistency check. For full video translation, by default, we sample key frames for every $K=10$ frames. The testing videos are from https://www.pexels.com/ and https://pixabay.com/, with their short side resized to 512.

In terms of running time for 512 $\times$ 512 videos, key frame and non-key frame translations take about 14.23s and 1.49s per frame, respectively. Overall, a full video translation takes about $(14.23+1.49(K-1))/K=1.49+12.74/K$ s per frame.

We will release our code upon publication of the paper.

2 Comparison with State-of-the-Art Methods

We compare with four recent zero-shot methods: vid2vid-zero , FateZero , Pix2Video , Text2Video-Zero on key frame translation with $K=5$ . The official code of the first three methods does not support ControlNet, and when loading customized models, we find they fail to generate plausible results, e.g., vid2vid-zero will generate frames totally different from the input. Therefore, only Text2Video-Zero and our method use the customized model with ControlNet. Figure 8 and Figure 9 present the visual results. FateZero successfully reconstructs the input frame but fails to adjust it to match the prompt. On the other hand, vid2vid-zero and Pix2Video excessively modify the input frame, leading to significant shape distortion and discontinuity across frames. While each frame generated by Text2Video-Zero exhibits high quality, they lack coherence in local textures as indicated by the black boxes. Finally, our proposed method demonstrates clear superiority in terms of output quality, content and prompt matching and temporal consistency.

For quantitative evaluation, we follow FateZero and Pix2Video to report Fram-Acc (CLIP-based frame-wise editing accuracy), Tmp-Con (CLIP-based cosine similarity between consecutive frames), Pixel-MSE (averaged mean-squared pixel error between aligned consecutive frames) in Table 2. Our method achieves the best temporal consistency and the second best frame editing accuracy. We further conduct a user study with 30 participants. The participants are asked to select the best results among the five methods based on three criteria: 1) how well the result balance between the prompt and the input frame, 2) the temporal consistency of the result, and 3) the overall quality of the video translation. Table 2 presents the average preference rates across 8 testing videos, and our method achieves the highest rates in all three metrics.

3 Ablation Study

Figure 10 compares the results with and without different cross-frame consistency constraints. We demonstrate the efficacy of our approach on a video containing simple translational motion in the first half and complex 3D rotation transformations in the latter half. To better evaluate the temporal consistency, we encourage readers to watch the videos on the project webpage. The cross-frame attention ensures consistency in global style, while the adaptive latent adjustment in Sec. 4.1.4 maintains the same hair color as the first frame, or the hair color will follow the input frame to turn dark. Note that the adaptive latent adjustment is optional to allow users to decide which color to follow. The above two global constraints cannot capture local movement. The shape-aware latent fusion (SA fusion) in Sec. 4.1.2 addresses this by translating the latent features to translate the neck ring, but cannot maintain pixel-level consistency for complex motion. Only the proposed pixel-aware latent fusion (PA fusion) can coherently render local details such as hair styles and acne.

We provide additional examples in Figs. 11-12 to demonstrate the effectiveness of PA fusion. While ControlNet can guide the structure well, the inherent randomness introduced by noise addition and denoising makes it difficult to maintain coherence in local textures, resulting in missing elements and altered details. The proposed PA fusion restores these details by utilizing the corresponding pixel information from previous frames. Moreover, such consistency between key frames can effectively reduce the ghosting artifacts in interpolated non key frames.

We present a detailed analysis of our fidelity-oriented image encoding in Figs. 13-15, in addition to Fig. 5. Two Stable Diffusion’s officially released autoencoders, the fine-tuned f8-ft-MSE VAE and the original more lossy kl-f8 VAE, are used for testing our method. The fine-tuned VAE introduces artifacts and the original VAE results in great color bias as in Fig. 13(b). Our proposed fidelity-oriented image encoding effectively alleviates these issues. For quantitative evaluation, we report the MSE between the input image and the reconstructed result after multiple encoding and decoding in Fig. 14, using the first 1,000 images of the MS-COCO validation set. The results are consistent with the visual observations: our proposed method significantly reduces error accumulation compared to raw encoding methods. Finally, we validate our encoding method in the video translation process in Fig. 15(b)(c), where we use only the previous frame without the anchor frame in Eq. (10) to better visualize error accumulation. Our method mostly reduces the loss of details and color bias caused by lossy encoding. Besides, our pipeline includes an anchor frame and adaptive latent adjustment to further regulate the translation, as shown in Fig. 15(d), where no obvious errors are observed.

We report the quantitative full video translation results of Fig. 10(a) under different $K$ in Table 3. With large $K$ , more frame interpolation improves pixel-level temporal consistency, which however harms the quality, leading to low Fram-Acc. A broad range of $K\in$ is recommended for balance.

4 More Results

The proposed pipeline allows flexible control over content preservation through the initialization of $x_{T}$ . Rather than setting $x_{T}$ to a Gaussian noise (Fig. 16(b)), we use a noisy latent version of the input frame to better preserve details (Fig. 16(c)). Users can adjust the value of $T$ to balance content and prompt. Moreover, if the input frame introduces unwanted color bias (e.g., blue sky in Chinese ink painting), a color correction option is provided: the input frame is adjusted to match the color histogram of the frame generated by $x_{T}=z_{T}$ (Fig. 16(b)). With the adjusted frame as input (bottom row of Fig. 16(a)), the rendered results (bottom row of Figs. 16(c)-(f)) better match the color indicated by the prompt.

Figure 17 shows some applications of our method. With prompts ‘a cute cat/fox/hamster/rabbit’, we can perform text-guided editing to translate a dog into other kinds of pets in Fig. 17(a). By using customized modes for generating cartoons or photos, we can achieve non-photorealistic and photorealistic rendering in Fig. 17(b) and Figs. 17(c)(d), respectively. In Fig. 18, we present our synthesized dynamic virtual characters of novels and manga, based on a real human video and a prompt to describe the appearance. Additional results are shown in Fig. 19.

5 Limitations

Figures 20-22 illustrate typical failure cases of our method. First, our method relies on optical flow and therefore, inaccurate optical flow can lead to artifacts. In Fig. 20, our method can only preserve the embroidery if the cross-frame correspondence is available. Otherwise, the proposed PA fusion will have no effect. Second, our method assumes the optical flow remains unchanged before and after translation, which may not hold true for significant appearance changes as in Fig. 21(b), where the resulting movement may be wrong. Although setting a smaller $T$ can address this issue, it may compromise the desired styles. Meanwhile, the mismatches of the optical flow mean the mismatches in the translated key frames, which may lead to ghosting artifacts (Fig. 21(d)) after temporal-aware blending. Also, we find that small details and subtle motions like accessories and eye movement cannot be well preserved during the translation. Lastly, we uniformly sample the key frames, which may not optimal. Ideally, the key frames should contain all unique objects; otherwise, the propagation cannot create unseen content such as the hand in Fig. 22(b). One potential solution is user-interactive translation, where users can manually assign new key frames based on the previous results.

Conclusion

This paper presents a zero-shot framework to adapt image diffusion models for video translation. Our method utilizes hierarchical cross-frame constraints to enforce temporal consistency in both global style and low-level textures, leveraging the key optical flow. The compatibility with existing image diffusion techniques indicates that our idea might be applied to other text-guided video editing tasks, such as video super-resolution and inpainting. Additionally, our proposed fidelity-oriented image encoding could benefit existing diffusion-based methods. We believe that our approach can facilitate the creation of high-quality and temporally-coherent videos and inspire further research in this field.

Acknowledgments. This study is supported under the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). It is also supported by Singapore MOE AcRF Tier 2 (MOE-T2EP20221-0011, MOE-T2EP20221-0012) and NTU NAP.