Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, Wei Liu

Introduction

Video outpainting aims to expand spatial contents of a video beyond its original boundaries to fill a designated canvas region. This task has numerous applications, such as enhancing viewing experience by adjusting aspect ratio of videos to match different users’ smartphones .

Recently, diffusion models have emerged as the dominant approach for visual generation, demonstrating exceptional visual synthesis ability by producing appealing results . Meanwhile, several diffusion-based video outpainting methods, such as M3DDM and MOTIA , have been proposed. They utilize the source video as a condition and generate the canvas region through step-by-step denoising, showing great performance. However, their results are limited in terms of resolution, such as 256×256256\times 256 and 512×1024512\times 1024 , or content expansion ratio, for example, from 256×85256\times 85 to 256×256256\times 256 (3×3\times) and from 512×512512\times 512 to 512×1024512\times 1024 (2×2\times) . This raises an intriguing question: “Is it possible to outpaint a video to higher resolution with a higher content expansion ratio?”

This question drives us to evaluate the capability of existing methods in tackling this difficult task. However, we find that they fall short due to limitations in GPU memory. To further explore their potential, we reduce the resolution of the source video through resizing and then resizing it back after outpainting (see details in Section 4). The results are depicted in Fig 2. We observe that both M3DDM and MOTIA produce low-quality results, e.g., blurry content and temporal inconsistencies. This motivates us to delve deeper into understanding the reasons behind this. We speculate that there are two possible factors contributing to this: (i) the reduced resolution after resizing negatively affects the performance, and (ii) the content expansion ratio is too high to achieve satisfactory results. We conduct experiments with respect to the variations of these factors, see Fig 3. The results demonstrate that both low resolution and a high content expansion ratio significantly reduce generation quality. In other words, achieving high-quality results requires performing outpainting in the original/high resolution with a low content expansion ratio.

Based on the analysis above, we propose a diffusion-based method called Follow-Your-Canvas for higher-resolution video outpainting with extensive content generation. We identify that the GPU memory limitations arises from the “single-shot” outpainting practice : directly taking the entire video as the input. In contrast, our Follow-Your-Canvas is designed to distribute the task across spatial windows. It kills two birds with one stone. First, it enables us to outpaint any videos to higher resolution with a high content expansion ratio, without being constrained by GPU memory. Second, it simplifies the challenging task by breaking it down into smaller and easier sub-tasks: outpainting each window in the original/high resolution with a low content expansion ratio. Specifically, during the training phase, we randomly sample an anchor window and a target window from the source video, mimicking the “source video” and “outpainting region” for inference respectively. It helps model learn how to flexibly outpaint with different relative positions and overlaps between the source video and outpainting region. During the inference phase, we outpaint a video by denoising windows that covering the entire video. To accelerate the generation process, we perform window outpainting in parallel on multiple GPUs. After each step of denoising, we seamlessly merge the windows using Gaussian weights to ensure a smooth transition between them. Due to the fact that videos of any resolution can be covered by a certain number of fixed size windows, while each window is limited within the GPU memory range, our Follow-Your-Canvas method could be applied to situations where the canvas size is very large.

Despite the advantages offered by the spatial window strategy, we observe conflicts between the layout generated within each window and the overall layout of the source video (see Fig 4). This issue arises due to the fact that the model input for each window is only a portion of the source video. Consequently, while the outpainting results within each window are reasonable, they fail to align with the overall layout, particularly when the overlap is low. To address this challenge, our Follow-Your-Canvas method incorporates the source video and its relative positional relation into the generation process of each window. This ensures that the generated layout harmonizes with the source video. Specifically, we introduce a Layout Encoder (LE) module, which takes the source video as input and provides overall layout information to the model through cross-attention. Meanwhile, we incorporate a Relative Region Embedding (RRE) into the output of the LE module, which offers information about the relative positional relation. The RRE is calculated based on the offset of the source video to the target window (outpainting region), as well as the size of them. The LE and RRE guide each window to generate outpainting results that conform to the global layout based on its relative position, effectively improving the spatial-temporal consistency.

Coupling with the strategies of spatial window and layout alignment, our Follow-Your-Canvas excels in large-scale video outpainting. For example, it outpaints videos from 512×512512\times 512 to 1152×20481152\times 2048 (9×9\times), while delivering high-quality and aesthetically pleasing results (Fig 1). When compared to existing methods, Follow-Your-Canvas produces better results by maintaining spatial-temporal consistency (Fig 2). Follow-Your-Canvas also achieves the best quantitative results across various resolution and scale setups. For example, it improves FVD from 928.6928.6 to 735.3735.3 (+193.3+193.3) when outpainting from 512×512512\times 512 to 2048×11522048\times 1152 (9×9\times) on the DAVIS 2017 dataset.

Our main contributions are summarized as follow:

We emphasize the importance of high resolution and a low content expansion ratio for video outpainting.

Based on the observation, we distribute the task across spatial windows, which not only overcomes GPU memory limitations but also enhances outpainting quality.

To ensure alignment between the generated layout and the source video, we incorporate the source video and its relative positional relation into the generation process.

Our Follow-Your-Canvas demonstrates great outpainting capabilities through both qualitative and quantitative results.

Related Work

Diffusion models are a class of generative models that progressively convert noise into structured data through a learned denoising process. It has garnered significant attention in visual generation . By applying diffusion models in the latent space, LDM has demonstrated the ability to generate high-quality images by utilizing limited computational resources. Meanwhile, many works generate impressive videos by inserting temporal layers into the model structure. This has promoted the rapid development of video generation in editing , controllable generation , outpainting , etc.

Video outpainting seeks to extend the spatial contents of a video beyond its initial boundaries, allowing it to fill a specific canvas region. Although image outpainting has been extensively studied, video outpainting still needs to be fully researched. Recently, some diffusion-based approaches have been introduced. M3DDM presents global frame-guided training with a coarse-to-fine inference pipeline to tackle the artifact accumulation issue. Meanwhile, MOTIA proposes a test sample-specific fine-tuning strategy to learn the patterns of each sample. Despite their great results, they are limited in terms of resolution such as 256×256256\times 256 and 512×1024512\times 1024, or content expansion ratio such as 2×2\times and 3×3\times. As these two factors are the core of outpainting, this paper makes the first attempt to study video outpainting with high resolution, e.g., 1152×20481152\times 2048, and a high content expansion ratio, e.g., 9×9\times.

Method

We present Follow-Your-Canvas, a diffusion-based method, which enables higher-resolution video outpainting with extensive content generation. Our approach is built upon two key designs. First, we employ spatial windows to divide the outpainting task into smaller and easier sub-tasks. Second, we introduce a layout encoder module as well as a relative region embedding to align the generated spatial layout.

To address the GPU memory limitations, we distribute the outpainting task across spatial windows. It allows us to outpaint any videos to higher resolution with a high content expansion ratio without being constrained by GPU memory. Moreover, it simplifies the task by breaking it down into smaller and easier sub-tasks: outpainting each window in its original/high resolution with a low content expansion ratio.

Training phase. Fig 5 illustrates the training phase of Follow-Your-Canvas. Given each training video sample, we randomly crop an anchor window and a target window. They serve as the “source video” and the “region to perform outpainting” respectively, mimicking the source video and the outpainting windows during inference, respectively. The conventional training practice of the latent diffusion model adds noise to the latent representation of the data (the target window) to build the model input and makes the model predict the noise. Here, we concatenate it with conditions: the latent representation of a masked target window and the binary mask. They offer information of the original video and its position. Since the channel of the mask and the latent representations output by the VAE encoder are 1 and 4 respectively, the final model input has 9 channels. We modify the first convolution layer of the denoising UNet to adjust to the channel changes, similar to previous works . However, instead of employing a fixed region for outpainting , we use a random sample of the anchor window and the target window. It helps the model learn to flexibly outpaint with different relative positions and overlaps between the source video and the outpainting region, enabling the sliding window-based inference phase described next. Note that the size of the anchor window, the target window, and their overlap are all variables. See details in experiments.

Inference phase. Fig 6 illustrates the inference phase of Follow-Your-Canvas. Given a source video to be outpainted, our Follow-Your-Canvas first determines the number (denoted as NN) of spatial windows and their positions, which should cover the source video and fill the target region to be outpainted (find more details in experiments). During each denoising step tt, Follow-Your-Canvas performs outpainting within each window kk on noisy data xtk\mathbf{x}_{t}^{k}, where k{1,...,N}k\in\{1,...,N\}. Here, the source video and the window correspond to the anchor window and the target window of the training phase respectively. The denoised outputs in the NN windows, i.e., {xt1k}k=1N\{\mathbf{x}_{t-1}^{k}\}_{k=1}^{N}, are then merged via Gaussion weights to get a smooth outcome xt1\mathbf{x}_{t-1}. The process is repeated until the final outpainting result x0\mathbf{x}_{0} is obtained. Importantly, the inference process of each window is independent of the others, allowing us to perform outpainting within each window in parallel on separate GPUs, thereby accelerating the inference. We analyze its efficiency in experiments.

Layout Alignment Despite the advantages offered by the spatial window strategy, we observe conflicts between the layout generated within each window and the overall layout of the source video, as shown in Fig 4. The outpainting results within each window of the “baseline”, which only applies the spatial window strategy, are reasonable. However, they do not align with the global layout because each window is provided with a view of only a part of the source video. To enable spatial and temporal consistency, we introduce a layout encoder and relative region embedding. They deliver the layout information of the source video and its relative position relation to each window respectively, effectively helping the model generate more stable and consistent outpainting videos (see the results of “+LE & RRE” method in Fig 4).

Layout Encoder (LE). Similar to the text encoder that injects the text prompts into the model, we introduce LE to incorporate layout information from the source video, see Fig 5. Specifically, LE consists of a SAM encoder , a layout extraction module, and a Q-former . Instead of employing the CLIP visual encoder like many previous works , we find SAM encoder (ViT-B/16 structure) is more effective to extract visual features by providing finer visual details (see comparisons in experiments). Then, the layout features are extracted by the layout extraction module, including a pseudo-3D convolution layer, two temporal attention layers, and a temporal pooling layer. Inspired by 16, we employ a Q-former (Querying Transformer) to extract and refine visual representations of the layout information by learnable query tokens. We train the layout extraction module and the Q-former while fixing the SAM encoder. The relative region embedding is added to the output of the LE to provide a positional relation between the anchor window and the target window, introduced next.

Relative Region Embedding (RRE). RRE provides the positional relation between the anchor window and the target window (see Fig 5). We denote the height, width, and center point coordinates of the anchor window as HanchorH_{\text{anchor}}, WanchorW_{\text{anchor}}, and (Xanchor,Yanchor)(X_{\text{anchor}},Y_{\text{anchor}}) respectively. The target window is defined in the same way. RRE employs sinusoidal position encoding to embed the size and relative position relation between the anchor and target windows, i.e., {Hanchor,Wanchor,Htarget,Wtarget,Hoffset,Woffset}\{H_{\text{anchor}},W_{\text{anchor}},H_{\text{target}},W_{\text{target}},H_{\text{offset}},W_{\text{offset}}\}, where Hoffset=YtargetYanchor,Woffset=XtargetXanchorH_{\text{offset}}=Y_{\text{target}}-Y_{\text{anchor}},W_{\text{offset}}=X_{\text{target}}-X_{\text{anchor}}. The embeddings are then fed to a fully-connected (FC) layer. The output of the FC layer is repeated to match the output of the LE. We incorporate the LE and RRE using a cross-attention layer inserted in each spatial-attention block of the model. Due to the limitation of paper length, we leave more details about the design of the model structure in the appendix.

Experiments

Dataset. M3DDM use a private dataset with \sim5M video samples. Here, we employ a random subset (\sim1M video samples) of the public Panda-70M dataset for training, improving reproducibility of our work.

Implementation details. Our implementation and model initialization is based on the popular video generation framework of AnimateDiff-V2 . Due to the limitation of paper length, we leave more specific details about the training recipe, the design of the anchor and target windows, and the inference pipeline in the appendix.

Evaluation metrics. We first employ metrics of PSNR, SSIM , LPIPS , and FVD by following 32. To evaluate high-resolution video generation, we further utilize aesthetic quality (AQ) and imaging quality (IQ) , assessing the layout/color harmony and visual distortion (e.g., noise and blur) respectively.

Baselines. We compare our Follow-Your-Canvas with the following baseline methods. (1) 6 use the approach of flow estimation and background prediction. (2) M3DDM employs global-frame features to achieve global and long-range information transfer. 3) MOTIA trains a LoRA to learn patterns of test samples. We reproduce these baseline methods using their official codes for high-resolution video outpainting and directly cite their results in low-resolution.

2 Comparisons to Baseline Methods

We compare methods in both high and low-resolution settings. (1) High-resolution with large content expansion ratios. Table 1 shows the results. Our Follow-Your-Canvas consistently achieves the best performance for all metrics and outpainting settings. Meanwhile, as the resolution and content expansion ratio increase, the performance improvement of many metrics becomes more significant. For example, Follow-Your-Canvas improves FVD from 473.7 to 440.0 (+33.7) in 720P (\sim3.5×\times), improves from 575.9 to 486.1 (+89.8) in 1.5K, and improves from 928.6 to 735.3 (+193.3) in 2K. Our Follow-Your-Canvas effectively improves performance in the challenging task of high-resolution outpainting with high content expansion ratios. (2) Conventional settings in low-resolution. Following 7 and 32, we also compare results in low-resolution, which outpaint videos to 256×256256\times 256 in the horizontal direction using mask ratio of 0.250.25 (1.3×\sim 1.3\times) and 0.660.66 (3×\sim 3\times) and calculate the average performance. Table 2 shows the results. Our Follow-Your-Canvas still achieves excellent performance under this conventional setting. Note that MOTIA fine-tunes the model for each test sample which may not be efficient, while our Follow-Your-Canvas method performs zero-shot inference after model training.

2.2 Qualitative results.

In Fig. 7, we showcase the qualitative results. It is evident that M3DDM fails to generate meaningful content in the majority of outpainting regions. On the other hand, MOTIA faces difficulties in maintaining spatial and temporal consistencies, which can be attributed to the challenging task of handling high resolution and content expansion ratios. In contrast, our Follow-Your-Canvas successfully generates well-structured visual content. It is because the design of spatial windows that outpaint within each window in its original/high resolution with a low content expansion ratio. Moreover, the layout alignment plays a crucial role in guiding the overall layout of the outpainting results.

3 Ablation Study

We conduct the ablation study by outpainting the source video from 512×512512\times 512 to 1440×8101440\times 810, as shown in Table 3. We find relative region embedding (RRE), layout encoder (LE), and layout extraction module are all important to achieve the best results. Compared to the popular CLIP encoder, we observe that the SAM encoder helps the model to further improve outpainting results. Visual results are shown in Fig 8.

Conclusion

Largely expanding an image/video is the core of the outpainting task. In this study, we take the first step towards exploring higher-resolution video outpainting with high content expansion ratios. We achieve this by introducing the spatial window strategy combined with the design of layout alignment. Our Follow-Your-Canvas method allows for large-scale video outpainting, e.g., from 512×512512\times 512 to 1152×20481152\times 2048 (9×\times). We hope our work can pave the way for further progress in this promising direction and push this frontier.

Limitations.Although Follow-Your-Canvas has achieved great outpainting performance, it may have a longer inference time due to the spatial window strategy, as shown in Table 4. To reduce time consumption, we suggest users utilize multiple GPUs in parallel. Besides, we encourage further research to investigate techniques for improving inference speed.

References

More Implementation details

The quantitative metric evaluation of our method is based on the DAVIS dataset. The DAVIS (Densely Annotated VIdeo Segmentation) dataset is pivotal for video object segmentation research. Following 7 and 32, we use the DAVIS 2017 TrainVal subset, which contains 9090 videos for evaluating the outpainting performance. For the task of high-resolution video outpainting, we use the DAVIS 2017 dataset with full resolution, which has an average resolution of 1338×24001338\times 2400. For the task of low-resolution video outpainting, we use the 480480p version of the DAVIS dataset following 7.

We employ the popular metrics including Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) , Learned Perceptual Image Patch Similarity (LPIPS) , and Frechet Video Distance (FVD) , similar to previous works . We further include metrics of aesthetic quality (AQ) and imaging quality (IQ) from VBench for video generation quality evaluation (without ground-truth). Specifically, AQ evaluates the layout/color richness and harmony, while IQ assesses the visual distortion such as noise and blur.

2 Baseline Methods

We reproduce the baseline methods using their official codes for high-resolution video outpainting and directly cite their results in low-resolution. Specifically, since M3DDM only supports 256256-resolution outpainting, we resize the source video to perform outpainting, and resize the outpainting video to the target resolution by bilinear interpolation. We conduct other methods in the same way if they are constrained by the GPU memory. Although it is not fair enough for comparison, our Infinite-Canvas achieves the best results for both the high-resolution and the low resolution tasks.

3 Training of Infinite-Canvas

4 Inference of Infinite-Canvas

After training the model using the spatial window strategy, we can outpaint a video from any resolution to any target resolution by dividing the outpainting area into multiple windows and blending the denoising results. Specifically, we partition the outpainting region into spatial windows and perform outpainting in multiple rounds, as shown in Figure 9. In the first round, the source video acts as the “anchor window”, while subsequent rounds utilize the outpainting results from the previous round as the anchor window. This process is repeated until the designated canvas is filled. See the inference pipeline of Infinite-Canvas in Algorithm 1.

Preliminaries

Diffusion models consist of two processes: a diffusion/forward process that gradually adds Gaussian noise to the clean data using a fixed Markov chain with TT steps, and a denoising/reverse process where the trained model generates samples from Gaussian noise. Building upon the diffusion model, the latent diffusion model (LDM) performs both the diffusion and denoising processes in a latent space to achieve efficient learning. Specifically, LDM encodes the raw pixels x\mathbf{x} into a latent space using a VAE encoder ε\varepsilon, that is, z=ε(x)\mathbf{z}=\varepsilon(\mathbf{x}). Meanwhile, the original pixels x\mathbf{x} can be approximately reconstructed from the latent representation z\mathbf{z} using a VAE decoder D\mathcal{D}, that is, D(z)x\mathcal{D}(\mathbf{z})\approx\mathbf{x}.

In this work, we build our Infinite-Canvas model upon the video latent diffusion model for video generation. It inflates the 2D layers of LDM into pseudo-3D layers, incorporating temporal information. It also introduces a temporal motion module to each spatial module in LDM, enabling the model to generate smooth and stable videos. In the latent space, a Unet εθ\varepsilon_{\theta} estimates the added noise guided by the objective:

where CC is the condition and ztz_{t} is a noisy sample of z0z_{0} at timestep tt. During inference, given input noise zTz_{T} sampled from a Gaussian distribution, network εθ\varepsilon_{\theta} denoises ztz_{t} step-by-step and decodes the final latent representation by D\mathcal{D}.

2 Diffusion-based Video Outpainting

Video outpainting aims to generate the surrounding regions of a given source video, which can be considered as a conditional video generation task. Its key objective is to make the generated video not only exhibit well-structured spatial layout but also preserves temporal consistency. Following 7, 32, we denote the original pixels as x\mathbf{x}, a 0-1 binary mask as m\mathbf{m}, the known region as xknown=(1m)x\mathbf{x}^{\text{known}}=(1-\mathbf{m})\odot\mathbf{x}, and the unknown region as xunknown=mx\mathbf{x}^{\text{unknown}}=\mathbf{m}\odot\mathbf{x}, where \odot represents Hadamard product. We concatenate the noisy latent representation of the source video, i.e., zT\mathbf{z}_{T}, with its context as a condition, including the latent representation of the masked video z0known\mathbf{z}^{\text{known}}_{0} and the mask m\mathbf{m} after resizing. Model parameters θ\theta is trained by

where the condition is: C={zknown,m,etext}C=\left\{\mathbf{z}^{\text{known}},\mathbf{m},e_{\text{text}}\right\}, and etexte_{\text{text}} represents the text embedding extracted from a text prompt.

Additional Results

We further conduct a user study comparing our method with MOTIA and M3DDM. We use the DAVIS dataset to outpaint the source video from 512×512512\times 512 to 1440×8101440\times 810 resolution. We collect preferences from 30 volunteers, who evaluate 50 randomly selected sets of results based on visual quality (including clarity, color fidelity, and texture detail), realism (whether the overall outpainted scene is harmonious), spatial consistency, and temporal consistency. As shown in Fig. 10, the results from our Infinite-Canvas method is overwhelmingly preferred over the other baseline methods.

2 Prompt-Following Results

Since our Infinite-Canvas is based on Animatediff with a text encoder, it naturally supports controlling the generated content using text prompts. We provide three different prompts for outpainting a source video, as shown in Fig. 12. It is interesting to find that our Infinite-Canvas enables one to control the outpainting contents using different text prompts.