CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen

Introduction

The field of image processing has witnessed remarkable advancements, largely attributable to the power of generative models trained on extensive datasets, yielding exceptional quality and precision. However, the processing of video content has not achieved comparable progress. One challenge lies in maintaining high temporal consistency, a task complicated by the inherent randomness of neural networks. Another challenge arises from the nature of video datasets themselves, which often include textures of inferior quality compared to their image counterparts and necessitate greater computational resources. Consequently, the quality of video-based algorithms significantly lags behind those focused on images. This contrast prompts a question: is it feasible to represent video in the form of an image to seamlessly apply established image algorithms to video content with high temporal consistency?

In pursuit of this objective, researchers have suggested the generation of video mosaics from dynamic videos in the era preceding deep learning, and the utilization of a neural layered image atlas subsequent to the proposal of implicit neural representations. Nonetheless, these methods exhibit two principal deficiencies. First, the capacity of these representations, particularly in faithfully reconstructing intricate details within a video, is restricted. Often, the reconstructed video overlooks subtle motion details, such as blinking eyes or slight smiles. The second limitation pertains to the typically distorted nature of the estimated atlas, which consequently suffers from impaired semantic information. Existing image processing algorithms, therefore, do not perform optimally as the estimated atlas lacks sufficient naturalness.

We propose a novel approach to video representation that utilizes a 2D hash-based image field coupled with a 3D hash-based temporal deformation field. The incorporation of multi-resolution hash encoding for the representation of temporal deformation significantly enhances the ability to reconstruct general videos. This formulation facilitates tracking the deformation of complex entities such as water and smog. However, the heightened capability of the deformation field presents a challenge in estimating a natural canonical image. An unnatural canonical image can also estimate the corresponding deformation field with a faithful reconstruction. To navigate this challenge, we suggest employing annealed hash during training. Initially, a smooth deformation grid is utilized to identify a coarse solution applicable to all rigid motions, with high-frequency details added gradually. Through this coarse-to-fine training, the representation achieves a balance between the naturalness of the canonical and the faithfulness of the reconstruction. We observe a noteworthy enhancement in reconstruction quality compared to preceding methods. This improvement is quantified as an approximately 4.4 increase in PSNR, along with an observable increase in the naturalness of the canonical image. Our optimization process requires a mere approximate 300 seconds to estimate the canonical image with the deformation field while the previous implicit layered representations takes more than 10 hours.

Building upon our proposed content deformation field, we illustrate lifting image processing tasks such as prompt-guided image translation, super-resolution, and segmentation—to the more dynamic realm of video content. Our approach to prompt-guided video-to-video translation employs ControlNet on the canonical image, propagating the translated content via the learned deformation. The translation process is conducted on a single canonical image and obviates the need for time-intensive inference models (e.g., Diffusion models) across all frames. Our translation outputs exhibit marked improvements in temporal consistency and texture quality over the state-of-the-art zero-shot video translations with generative models . When contrasted with Text2Live, which relies on a neural layered atlas, our model is proficient in handling more complex motion, producing more natural canonical images, and thereby achieving superior translation results. Additionally, we extend the application of image algorithms such as super-resolution, semantic segmentation, and keypoints detection to the canonical image, leading to their practical applications in video contexts. This includes video super-resolution, video object segmentation, video keypoints tracking, among others. Our proposed representation consistently delivers superior temporal consistency and high-fidelity synthesized frames, demonstrating its potential as a groundbreaking tool in video processing.

Related Work

Implicit Neural Representations. Implicit representations in conjunction with coordinate-based Multilayer Perceptrons (MLPs) have demonstrated its powerful capability in accurately representing images , videos , and 3D/4D representations . These techniques have been employed in a range of applications, including novel view synthesis , image super-resolution , and 3D/4D Reconstruction . Furthermore, for the purpose of speeding up the training, a various of acceleration techniques have been explored to replace the original Fourier positional encoding with some discrete representation like multi-resolution feature grid or hash table. Moreover, the adoption of an implicit deformation field has displayed a remarkable capability to overfit dynamic scenes. Inspired by these works, our primary objective is to reconstruct videos by utilizing a canonical image which inherit semantics for video processing purposes.

Consistent Video Editing. Our research is closely aligned with the domain of consistent video editing , which predominantly features two primary approaches: propagation-based methods and layered representation-based techniques. Propagation-based methods center on editing an initial frame and subsequently disseminating those edits throughout the video sequence. While this approach offers advantages in terms of computational efficiency and simplicity, it may be prone to inaccuracies and inconsistencies during the propagation of edits, particularly in situations characterized by complex motion or occlusion. Conversely, layered representation-based techniques entail decomposing a video into distinct layers, thereby facilitating greater control and flexibility during the editing process. Text2Live introduces the application of CLIP models for video editing by modifying an optimized atlas using text inputs, thereby yielding temporally consistent video editing results. Our work bears similarities to Text2Live in the context of employing an optimized representation for videos. However, our methodology diverges in several aspects: we optimize a more semantically-aware canonical representation incorporating a hash-based deformable design and attain higher-fidelity video processing.

Video Processing via Generative Models. The advancement of diffusion models has markedly enhanced the synthesis quality of text-to-image generation , surpassing the performance of prior methodologies . State-of-the-art diffusion models, such as GLIDE , Dall-E 2 , Stable Diffusion , and Imagen , have been trained on millions of images, resulting in exceptional generative capabilities. While existing text-to-image (T2I) models enable free-text generation, incorporating additional conditioning factors such as edge, depth map, and normal map is essential for achieving precise control. In an effort to enhance controllability, researchers have proposed several approaches. For instance, PITI involves retraining an image encoder to map latents to the T2I latent space. InstructPix2Pix , on the other hand, fine-tunes T2I models using synthesized image condition pairs. ControlNet introduces additional control conditions for Stable Diffusion through an auxiliary branch, thereby generating images that faithfully adhere to input condition maps. A recent research direction concentrates on the processing of videos utilizing text-to-image (T2I) models exclusively. Approaches like Tune-A-Video , Text2Video-Zero , FateZero , Vid2Vid-Zero , and Video-P2P explore the latent space of DDIM and incorporate cross-frame attention maps to facilitate consistent generation. Nevertheless, these methods may experience compromised temporal consistency due to the inherent randomness of generation, and the control condition may not be achieved with precision.

Text-to-video generation has emerged as a prominent research area in recent years, with prevalent approaches encompassing the training of diffusion models or autoregressive transformers on extensive datasets. Although text-to-video architectures such as NUWA , CogVideo , Phenaki , Make-A-Video , Imagen Video , and Gen-1 are capable of generating video frames that semantically correspond to the input text, they may exhibit limitations in terms of precise control over video conditions or low resolution due to substantial computational demands.

Method

Problem Formulation. Given a video $V$ comprised of frames $\{I_{1},I_{2},...,I_{N}\}$ , one can naively apply the image processing algorithm $\mathcal{X}$ to each frame individually for corresponding video tasks, yet may observe undesirable inconsistencies across frames. An alternative strategy involves enhancing algorithm $\mathcal{X}$ with a temporal module, which requires additional training on video data. However, simply introducing a temporal module is hard to guarantee theoretical consistency and may result in performance degradation due to insufficient training data.

Motivated by these challenges, we propose representing a video $\mathcal{V}$ using a flattened canonical image $I_{c}$ and a deformation field $\mathcal{D}$ . By applying the image algorithm $\mathcal{X}$ on $I_{c}$ , we can effectively propagate the effect to the whole video with the learned deformation field. This novel video representation serves as a crucial bridge between image algorithms and video tasks, allowing directly lifting of state-of-the-art image methodologies to video applications.

The proposed representations ought to exhibit the following essential characteristics:

Fitting Capability for Faithful Video Reconstruction. The representation should possess the ability to accurately fit large rigid or non-rigid deformations in videos.

Semantic Correctness of the Canonical Image. A distorted or semantically incorrect canonical image can lead to decreased image processing performance, especially considering that most of these processes are trained on natural image data.

Smoothness of the Deformation Field. The assurance of smoothness in the deformation field is an essential feature that guarantees temporal consistency and correct propagation.

Inspired by the dynamic NeRFs , we propose to represent the video in two distinct components: the canonical field and the deformation field. These two components are realized through the employment of a 2D and a 3D hash table, respectively. To enhance the capacity of these hash tables, two minuscule MLPs are integrated. We present our proposed representation for reconstructing and processing videos, as illustrated in Fig. 2. Given a video $\mathcal{V}$ comprising frames $\{I_{1},I_{2},...,I_{N}\}$ , we train an implicit deformable model tailored to fit these frames. The model is composed of two coordinate-based MLPs: the deformation field $\mathcal{D}$ and the canonical field $\mathcal{C}$ .

3D Hash Encoding for Deformation Field. Specifically, an arbitrary point in the video can be conceptualized as a position $\mathbf{x_{\text{3D}}}:(x,y,t)$ within an orthogonal 3D space.We represent our video space using the 3D hash encoding technique, as depicted on the left side of Fig. 2. This technique encapsulates the 3D space as a multi-resolution feature grid. The term multi-resolution refers to a composition of grids with varying degrees of resolution, and feature grid denotes a grid populated with learnable features at each vertex. In our framework, the multi-resolution feature grid is organized into $L$ distinct levels. The dimensionality of the learnable features is represented as $F$ . Furthermore, the resolution of the $l^{\text{th}}$ layer, denoted as $N_{l}$ , exhibits a geometric progression between the coarsest and finest resolutions, denoted collectively as $[N_{\text{min}},N_{\text{max}}]$ , using

Considering the queried points $\mathbf{x}_{\text{3D}}$ at $l^{\text{th}}$ layer, the input coordinate is scaled by that level’s grid resolution. And the queried features of $\mathbf{x}_{\text{3D}}$ are tri-linear interpolated from its 8-neighboring corner points(seen in Fig. 2). For attaining the corner points of $\mathbf{x}_{\text{3D}}$ , rounding down and up are first operated as

and we map its each corner to an entry in the level’s respective feature vector array, which has fixed size of at most $T$ . For the coarse level, the parameters of low resolution grid are fewer than $T$ , where the mapping is $1:1$ . Thus, the features can be directly looked up by its index. On the contrary. For the finer resolution, the point is mapped by the hash function,

where $\oplus$ denotes the bit-wise XOR operation and $\{\pi_{i}\}$ are unique large prime numbers following .

The output color value at coordinate $\mathbf{x}$ for frame $t$ can be computed as

This output can be supervised using the ground truth color present in the input frame.

2 Model Design

The proposed representation can effectively model and reconstruct both the canonical content and the temporal deformation for an arbitrary video. However, it faces challenges in meeting the requirements for robust video processing. In particular, while 3D hash deformation possesses powerful fitting capability, it compromises the smoothness of temporal deformation. This trade-off makes it notably difficult to maintain the inherent semantics of the canonical image, creating a significant barrier to the adaptation of established image algorithms for video use. To achieve precise video reconstruction while preserving the inherent semantics of the canonical image, we propose the use of annealed multi-resolution hash encoding. To further enhance the smoothness of deformation, we introduce flow-based consistency. In challenging cases, such as those involving large occlusions or complex multi-object scenarios, we suggest utilizing additional semantic information. This can be achieved by using semantic masks in conjunction with the grouped deformation fields.

Annealed 3D Hash Encoding for Deformation. For the finer resolution, the hash encoding enhance the complex deformation fitting performance but introducing the discontinuity and distortion in canonical field (Seen in Fig. 9). Inspired by the annealed strategy utilized in dynamic NeRFs , we employ the annealed hash encoding technique for progressive frequency filter for deformation. More specifically, we use a progressive controlling weights for those features interpolated in different resolution. The weight for the $l^{\text{th}}$ layer in training step $k$ is computed as

where $N_{\text{beg}}$ is a predefined step for beginning annealing and $m$ represents a hyper parameters for controlling the annealing speed, and $N_{\text{step}}$ is the number for annealing step.

Flow-guided Consistency Loss. Corresponding points identified by flows with high confidence should be the same points in the canonical field. We compute the flow-guided consistency loss according to this observation. For two consecutive frames $I_{i}$ and $I_{i+1}$ , we employ RAFT to detect the forward flows $\mathcal{F}_{i\rightarrow i+1}$ and backward flows $\mathcal{F}_{i+1\rightarrow i}$ . The confident region of a frame $I_{i}$ can be defined as

where $\epsilon$ represents a hyperparameter for the error threshold.

where $\mathcal{F}_{t\rightarrow t+1}^{\mathbf{x}}$ and $M_{\text{flow}}^{\mathbf{x}}$ are the optical flow and the flow confidence at $\mathbf{x}$ . The flow loss efficiently regularize the smoothness of the deformation field especially for the smooth region.

Grouped Content Deformation Fields. Although the representation can learn to reconstruct a video using a single content deformation field, complex motions arising from overlapped multi-objects may lead to conflicts within one canonical. Consequently, the boundary region might suffer from inaccurate reconstruction. For challenging instances featuring large occlusions, we propose an option to introduce the layers corresponding to multiple content deformation fields. These layers would be defined based on semantic segmentation, thereby improving the accuracy and robustness of video reconstruction in these demanding scenarios. We leverage the Segment-Anything-track (SAM-track) to attain the segmentation of each video frame $I_{i}$ into $K$ semantic layers with mask ${M_{0}^{i},...,M_{K-1}^{i}}$ . And for each layer, we use a group of canonical fields and deformation fields to represent those separate motion of different objects. These models are subsequently formulated as groups of implicit fields: $\mathcal{D}:\{\mathcal{D}_{1},...,\mathcal{D}_{K}\},\mathcal{C}:\{\mathcal{C}_{1},...,\mathcal{C}_{K}\}$ . In theory, for semantic layer $k$ in frame $i$ , it is sufficient to sample pixels in the region $M_{k}^{i}$ for efficient reconstruction. However, hash encoding can result in random and unstructured patterns in unsupervised regions, which decreases the performance of image-based models trained on natural images. To tackle this issue, we sample a number of points outside of the region $M_{k}^{i}$ and train them using $L_{2}$ loss with the ground truth color. In this way, we effectively regularize ${\bar{M}_{k}^{i}}$ with the background loss $\mathcal{L}_{\text{bg}}$ . Consequently, the canonical image attains a more natural appearance, leading to enhanced processing results.

Training Objectives. The representation is trained by minimizing the objective function $\mathcal{L}_{\text{rec}}$ . This function corresponds to the $L_{2}$ loss between the ground truth color and the predicted color $\mathbf{c}$ for a given coordinate $\mathbf{x}$ . To regularize and stabilize the training process, we introduce additional regularization terms as previously discussed. The total loss is calculated using the following equation

where $\lambda_{1}$ represents the hyper-parameters for loss weights. It’s important to note that when training the grouped deformation field, we include an additional regularizer, denoted as $\lambda_{2}*\mathcal{L}_{\text{bg}}$ .

3 Application to Consistent Video Processing

Upon the optimization of the content deformation field, the canonical image $I_{c}$ is retrieved by setting the deformation of all points to zero. It is important to note that the size of the canonical image can be flexibly adjusted to be larger than the original image size depending on the scene movement observed in the video, thereby allowing more content to be included. The canonical image $I_{c}$ is then utilized in executing various downstream algorithms for consistent video processing. We evaluated the following state-of-the-art (SOTA) algorithms: (1) ControlNet : Used for prompt-guided video-to-video translation. (2) Segment-anything (SAM) : Applied for video object tracking. (3) R-ESRGAN : Employed for video super-resolution. Additionally, the canonical image allows users to conveniently edit the video by directly modifying the image. We further illustrate this capability through multiple manual video editing examples.

Experiments

We conduct experiments to underscore the robustness and versatility of our proposed method. Our representation is robust with a variety of deformations, encompassing rigid and non-rigid objects, as well as complex scenarios such as smog. The default parameters for our experiments are set with the anneal begin and end steps at 4000 and 8000, respectively. The total iteration step is capped at 10000. On a single NVIDIA A6000 GPU, the average training duration is approximately 5 minutes when utilizing 100 video frames. It should be noted that the training time varies with several factors such as the length of the video, the type of motion, and the number of layers. By adjusting the training parameters accordingly, the optimization duration can be varied from 1 to 10 minutes.

2 Evaluation

The evaluation of our representation is concentrated on two main aspects: the quality of the reconstructed video with the estimated canonical image, and the quality of downstream video processing. Owing to the lack of accurate evaluation metrics, conducting a precise quantitative analysis remains challenging. Nevertheless, we include a selection of quantitative results for further examination.

Reconstruction Quality. In a comparative analysis with the Neural Image Atlas, our model, as demonstrated in Fig. 3, exhibits superior robustness to non-rigid motion, effectively reconstructing subtle movements with heightened precision (e.g. eyes blinking, face textures). Quantitatively, the video reconstruction PSNR of our algorithm on the collected video datasets is 4.4 dB higher. In comparison between the atlas and our canonical image, our results provide a more natural representation, and thus, facilitate the easier application of established image algorithms. Besides, our method makes a significant progress in training efficiency, i.e., 5 minutes (ours) vs. 10 hours (atlas).

Downstream Video Processing. We provide an expanded range of potential applications associated with the proposed representations, including video-to-video translation, video keypoint tracking, video object tracking, video super-resolution, and user-interactive video editing.

(a) Video-to-video Translation. By applying image translation to the canonical image, we can perform video-to-video translation. A qualitative comparison is presented encompassing several baseline methods that fall into three distinct categories: (1) per-frame inference with image translation models, such as ControlNet ; (2) layered video editing, exemplified by Text-to-live ; and (3) diffusion-based video translation, including Tune-A-Video and FateZero . As depicted in Fig. 4, the per-frame image translation models yield high-fidelity content, accompanied by significant flickering. The alternative baselines exhibit compromised generation quality or comparatively low temporal consistency. The proposed pipeline effectively lifts image translation to video, maintaining the high quality associated with image translation algorithms while ensuring substantial temporal consistency. A thorough comparison is better appreciated by viewing the accompanying videos.

(b) Video Keypoint Tracking. By estimating the deformation field for each individual frame, it is feasible to query the position of a specific keypoint in one frame within the canonical space and subsequently identify the corresponding points present in all frames as in Fig. 5. We show the demonstration of tracking points in non-rigid objects such as fluids in the videos on the project page.

(c) Video Object Tracking. Using the segmentation algorithms on the canonical image, we are able to facilitate the propagation of masks throughout all video sequences leveraging the content deformation field. As illustrated in Fig. 6, our pipeline proficiently yields masks that maintain consistency across all frames.

(d) Video Super-resolution. By directly applying the image super-resolution algorithm to the canonical image, we can execute video super-resolution to generate high-quality video as in Fig. 7. Given that the deformation is represented by a continuous field, the application of super-resolution will not result in flickering.

(e) User interactive Video Editing. Our representation allows for user editing on objects with unique styles without influencing other parts of the image. As exemplified in Fig. 8, users can manually adjust content on the canonical image to perform precise edits in areas where the automatic editing algorithm may not be achieving optimal results.

3 Ablation Study

To validate the effect of the proposed modules, we conducted an ablation study. On substituting the 3D hash encoding with positional encoding, there is a notable decrease in the reconstruction PSNR of the video by 3.1 dB. In the absence of the annealed hash, the canonical image loses its natural appearance, as evidenced by the presence of multiple hands in Fig. 9. Furthermore, without incorporating the flow loss, smooth areas are noticeably affected by pronounced flickering. For a more extensive comparison, please refer to the videos on the project page.

Conclusion and Discussion

In this paper, we have investigated representing videos as content deformation fields, focusing on achieving temporally consistent video processing. Our approach demonstrates promising results in terms of both fidelity and temporal consistency. However, there remain several challenges to be addressed in future work.

One of the primary issues pertains to the per-scene optimization required in our methodology. We anticipate that advancements in feed-forward implicit field techniques could potentially be adapted to this direction. Another challenge arises in scenarios involving extreme changes in viewing points. To tackle this issue, the incorporation of 3D prior knowledge may prove beneficial, as it can provide additional information and constraints. Lastly, the handling of large non-rigid deformations remains a concern. To address this, one potential solution involves employing multiple canonical images , which can better capture and represent complex deformations.