Generating Long Videos of Dynamic Scenes

Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A. Efros, Tero Karras

cs.CV cs.AI cs.LG cs.NE

Introduction

Videos are data that change over time, with complex patterns of camera viewpoint, motion, deformation and occlusion. In certain respects, videos are unbounded — they may last arbitrarily long and there is no limit to the amount of new content that may become visible over time. Yet videos that depict the real world must also remain consistent with physical laws that dictate which changes over time are feasible. For example, the camera may only move through 3D space along a smooth path, objects cannot morph between each other, and time cannot go backward. Generating realistic long videos thus requires the ability to produce endless new content while simultaneously incorporating the appropriate consistencies.

In this work, we focus on generating long videos with rich dynamics and new content that arises over time. While existing video generation models can produce “infinite” videos, the type and amount of change along the time axis is highly limited. For example, a synthesized infinite video of a person talking will only include small motions of the mouth and head. Moreover, common video generation datasets often contain short clips with little new content over time, which may inadvertently bias the design choices toward training on short segments or pairs of frames, forcing content in videos to stay fixed, or using architectures with small temporal receptive fields.

We make the time axis a first-class citizen for video generation. To this end, we introduce two new datasets that contain motion, changing camera viewpoints, and entrances/exits of objects and scenery over time. We learn long-term consistencies by training on long videos and design a temporal latent representation that enables modeling complex temporal changes. Figure 1 illustrates the rich motion and scenery changes that our model is capable of generating. See our webpagehttps://www.timothybrooks.com/tech/long-videos for video results.

Our main contribution is a hierarchical generator architecture that employs a vast temporal receptive field and a novel temporal embedding. We employ a multi-resolution strategy, where we first generate videos at low resolution and then refine them using a separate super-resolution network. Naively training on long videos at high spatial resolution is prohibitively expensive, but we find that the main aspects of a video persist at a low spatial resolution. This observation allows us to train with long videos at low resolution and short videos at high resolution, enabling us to prioritize the time axis and ensure that long-term changes are accurately portrayed. The low-resolution and super-resolution networks are trained independently with an RGB bottleneck in between. This modular design allows iterating on each network independently and leveraging the same super-resolution network for different low-resolution network ablations.

We compare our results to several recent video generative models and demonstrate state-of-the-art performance in producing long videos with realistic motion and changes in content. Code, new datasets, and pre-trained models on these datasets will be made available.

Prior work

Video generation is a challenging problem with a long history. The classic early works, Video Textures and Dynamic Textures , model videos as textures by analogy with image textures. That is, they explicitly assume the content to be stationary over time, e.g., fire burning, smoke rising, foliage falling, pendulum swinging, etc., and use non-parametric or parametric approaches to model that stationary distribution. Although subsequent video synthesis works have dropped the “texture” moniker, much of the limitations remain similar — short training videos and models which produce little or no new objects entering the frame during the video. Below we summarize some of the more recent efforts on video generation.

Many video generation works are based on GANs , including early models that output fixed-length videos and approaches that use recurrent networks to produce a sequence of latent codes used to generate frames . MoCoGAN explicitly disentangles “motion” from “content” and keeps the latter fixed over the entire generated video. StyleGAN-V is a recent state-of-the-art model we use as a primary baseline. Similar to MoCoGAN, StyleGAN-V employs a global latent code that controls content of an entire video. MoCoGAN-HD , which we also compare with, and StyleVideoGAN attempt to generate videos by navigating the latent space of a pretrained StyleGAN2 model , but struggle to produce realistic motion. Unlike previous StyleGAN-based video models, we prioritize the time axis in our generator through a new temporal latent representation, temporal upsampling, and spatiotemporal modulated convolutions. We also compare with DIGAN that employs an implicit representation to generate the video pixel by pixel.

Transformers are another class of models used for video generation . We compare with TATS that generates long unconditional videos with transformers, improving upon VideoGPT . Both TATS and VideoGPT employ a GPT-like autoregressive transformer that represents videos as sequences of tokens. However, the resulting videos tend to accumulate error over time and often diverge or change too rapidly. The models are also expensive to train and deploy due to their autoregressive nature over time and space. In concurrent work, promising results in generating diverse videos have also been demonstrated using diffusion-based models .

Conditional video prediction.

A separate line of research focuses on predicting future video frames conditioned on one or more real video frames or past frames accompanied by an action label . Some video prediction methods focus specifically on generating infinite scenery by conditioning on camera trajectory and/or explicitly predicting depth to then simulate a virtual camera flying through a 3D scene. Our goal, on the other hand, is to support camera movement as well as moving objects by having the scene structure emerge implicitly.

Multi-resolution training.

Training at multiple scales is a common strategy for image generation models , and transformer-based video generators also employ a related two-phase setup . Acharya et al. propose a multi-scale GAN for video generation that increases both spatial resolution and sequence length during training to produce a fixed-length video. In contrast, our multi-resolution approach is explicitly designed to enable generating arbitrarily long videos with rich long-term dynamics by utilizing the ability to train with long sequences at low resolution.

Our method

Modeling the long-term temporal behavior observed in real videos presents us with two main challenges. First, we must use long enough sequences during training to capture the relevant effects; using, e.g., pairs of consecutive frames fails to provide meaningful training signal for effects that occur over several seconds. Second, we must ensure that the networks themselves are capable of operating over long time scales; if, e.g., the receptive field of the generator spans only 8 adjacent frames, any two frames taken more than 8 frames apart will necessarily be uncorrelated with each other.

Figure 2a shows the overall design of our generator. We seed the generation process with a variable-length stream of temporal noise, consisting of 8 scalar components per frame drawn from i.i.d. Gaussian distribution. The temporal noise is first processed by a low-resolution generator to obtain a sequence of RGB frames at $64^{2}$ resolution that are then refined by a separate super-resolution network to produce the final frames at $256^{2}$ resolution.We handle datasets with non-square aspect ratio by shrinking all intermediate data accordingly. With 256 $\times$ 144 target resolution, for example, the low-resolution frames will have 64 $\times$ 36 resolution. The role of the low-resolution generator is to model major aspects of the motion and scene composition, which necessitates strong expressive power and a large receptive field over time, whereas the super-resolution network is responsible for the more fine-grained task of hallucinating the remaining details.

Our two-stage design provides maximum flexibility in terms of generating long videos. Specifically, the low-resolution generator is designed to be fully convolutional over time, so the duration and time offset of the generated video can be controlled by shifting and reshaping the temporal noise, respectively. The super-resolution network, on the other hand, operates on a frame-by-frame basis. It receives a short sequence of 9 consecutive low-resolution frames and outputs a single high-resolution frame; each output frame is processed independently using a sliding window. The combination of fully-convolutional and per-frame processing enables us to generate arbitrary frames in arbitrary order, which is highly desirable for, e.g., interactive editing and real-time playback.

The low-resolution and super-resolution networks are modular with an RGB bottleneck in between. This greatly simplifies experimentation, since the networks are trained independently and can be used in different combinations during inference. We will first describe the training and architecture of the low-resolution generator in Section 3.1 and then discuss the super-resolution network in Section 3.2.

Figure 2b shows our training setup for the low-resolution generator. In each iteration, we provide the generator with a fresh set of temporal noise to produce sequences of 128 frames (4.3 seconds at 30 fps). To train the discriminator, we sample corresponding sequences from the training data by choosing a random video and a random interval of 128 frames within that video.

We have observed that training with long sequences tends to exacerbate the issue of overfitting . As the sequence length increases, we suspect that it becomes harder for the generator to simultaneously model temporal dynamics at multiple time scales, but at the same time, easier for the discriminator to spot any mistakes. In practice, we have found strong discriminator augmentation to be necessary in order to stabilize the training. We employ DiffAug using the same transformation for each frame in a sequence, as well as fractional time stretching between $\frac{1}{2}\times$ and $2\times$ ; see Appendix C.1 for details.

Figure 3 illustrates the architecture of our low-resolution generator. Our main goal is to make the time axis a first-class citizen, including careful design of a temporal latent representation, temporal style modulation, spatiotemporal convolutions, and temporal upsamples. Through these mechanisms, our generator spans a vast temporal receptive field (5k frames), allowing it to represent temporal correlations at multiple time scales.

We employ a style-based design, similar to Karras et al. , that maps the input temporal noise into a sequence of intermediate latents $\{w_{t}\}$ used to modulate the behavior of each layer in the main synthesis path. Each intermediate latent is associated with a specific frame, but it can significantly influence the scene composition and temporal behavior of several frames through hierarchical 3D convolutions that appear in the main path.

The main synthesis path starts by downsampling the temporal resolution of $\{w_{t}\}$ by 32 $\times$ and concatenating it with a learned constant at $4^{2}$ resolution. It then gradually increases the temporal and spatial resolutions through a series of processing blocks, illustrated in Figure 3 (bottom right), focusing first on the time dimension (ST) and then the spatial dimensions (S). The first four blocks have 512 channels, followed by two blocks with 256, two with 128 and two with 64 channels. The processing blocks consist of the same basic building blocks as StyleGAN2 and StyleGAN3 with the addition of a skip connection; the intermediate activations are normalized before each convolution and modulated according to an appropriately downsampled copy of $\{w_{t}\}$ . In practice, we employ bilinear upsampling and use padding for the time axis to eliminate boundary effects. Through the combination of our temporal latent representation and spatiotemporal processing blocks, our architecture is able to model complex and long-term patterns across time.

For the discriminator, we employ an architecture that prioritizes the time axis via wide temporal receptive field, 3D spatiotemporal and 1D temporal convolutions, and spatial and temporal downsamples; see Appendix C.3 for details.

2 Super-resolution network

Figure 2c shows our training setup for the super-resolution network. Our video super-resolution network is a straightforward extension of StyleGAN3 for conditional frame generation. Unlike the low-resolution network that outputs a sequence of frames and includes explicit temporal operations, the super-resolution generator outputs a single frame and only utilizes temporal information at the input, where the real low-resolution frame and $4$ neighboring real low-resolution frames before and after in time are concatenated along the channel dimension to provide context. We remove the spatial Fourier feature inputs and resize and concatenate the stack of low-resolution frames to each layer throughout the generator. The generator architecture is otherwise unchanged from StyleGAN3, including the use of an intermediate latent code that is sampled per video. Low-resolution frames undergo augmentation prior to conditioning as part of the data pipeline, which helps ensure generalization to generated low-resolution images.

The super-res discriminator is a similar straightforward extension of the StyleGAN discriminator, with $4$ low and high-resolution frames concatenated at the input. The only other change is the removal of the minibatch standard deviation layer that we found unnecessary in practice. Both low- and high-resolution segments of 4 frames undergo adaptive augmentation where the same augmentation is applied to all frames at both resolutions. Low-resolution segments also undergo aggressive dropout ( $p=0.9$ probability of zeroing out the entire segment), which prevents the discriminator from relying too heavily on the conditioning signal; see Appendix D.1 for details.

We find it remarkable that such a simple video super-resolution model appears sufficient for producing reasonably good high-resolution videos. We focus primarily on the low-resolution generator in our experiments, utilizing a single super-resolution network trained per dataset. We feel that replacing this simple network with a more advanced model from the video super-resolution literature is a promising avenue for future work.

Datasets

Most of the existing video datasets introduce little or no new content over time. For example, talking head datasets show the same person for the duration of each video. UCF101 portrays diverse human actions, but the videos are short and contain limited camera motion and little or no new objects that enter the videos over time.

To best evaluate our model, we introduce two new video datasets of first-person mountain biking and horseback riding (Figure 4a,b) that exhibit complex changes over time. Our new datasets include subject motion of the horse or biker, a first-person camera viewpoint that moves through space, and new scenery and objects over time. The videos are available in high definition and were manually trimmed to remove problematic segments, scene cuts, text overlays, obstructed views, etc. The mountain biking dataset has 1202 videos with a median duration of 330 frames at 30 fps, and the horseback dataset has 66 videos with a median duration of 6504 frames also at 30fps. We have permission from the content owners to publicly release the datasets for research purposes. We believe our new datasets will serve as important benchmarks for future work.

We also evaluate our model on the ACID dataset (Figure 4c) that contains significant camera motion but lacks other types of motion, as well as the commonly used SkyTimelapse dataset (Figure 4d) that exhibits new content over time as the clouds pass by, but the videos are relatively homogeneous and the camera remains fixed.

Results

We evaluate our model through qualitative examination of the generated videos (Section 5.1), analyzing color change over time (Section 5.2), computing the FVD metric (Section 5.3), and ablating the key design choices (Section 5.4). We compare with StyleGAN-V on all datasets. Mountain biking, horseback riding and ACID datasets contain videos with a $16{\mkern-2.0mu\times\mkern-2.0mu}9$ widescreen aspect ratio. We train at $256{\mkern-2.0mu\times\mkern-2.0mu}144$ resolution on these datasets to preserve the aspect ratio. Since StyleGAN-V is based on StyleGAN2 , we can easily extend it to support non-square aspect ratios by masking real and generated frames during training. We found it necessary to increase the R1 $\gamma$ hyperparameter by $10{\mkern-2.0mu\times\mkern-2.0mu}$ to produce good results with StyleGAN-V on our new datasets that exhibit complex changes over time. We compare with MoCoGAN-HD , TATS and DIGAN using pre-trained models for the SkyTimelapse dataset at $128^{2}$ resolution. For these comparisons, we train a separate super-resolution network to output the frames at $128^{2}$ resolution, but use the same low-resolution generator as in the $256^{2}$ comparison.

The major qualitative difference in results is that our model generates realistic new content over time, whereas StyleGAN-V continually repeats the same content. The effect is best observed by watching videos on the supplemental webpage and is additionally illustrated in Figure 1. Scenery changes over time in real videos and our results as the horse moves forward through space. However, the videos generated by StyleGAN-V tend to morph back to the same scene at regular intervals. Similar repeated content from StyleGAN-V is apparent on all datasets. For example, results on the webpage for the SkyTimelapse dataset show that clouds generated by StyleGAN-V repeatedly move back and forth. MoCoGAN-HD and TATS suffer from unrealistic rapid changes over time that diverge, and DIGAN results contain periodic patterns visible in both space and time. Our model is capable of generating a constant stream of new clouds.

As a further validation of our observations, we conducted a preliminary user study on Amazon Mechanical Turk. We created 50 pairs of videos for each of the 4 datasets. Each pair contained a random video generated by StyleGAN-V and one generated by our method, and we asked the participants which of them exhibited more realistic motion in a forced-choice response. Each pair was shown to 10 participants, resulting in a total of $50{\mkern-2.0mu\times\mkern-2.0mu}4{\mkern-2.0mu\times\mkern-2.0mu}10$ responses. Our method was preferred over 80% of the time for every dataset. Please see Appendix A.1 for details.

2 Analyzing color change over time

To gain insight into how well different methods produce new content at appropriate rates, we analyze how the overall color scheme changes as a function of time. We measure color similarity as the intersection between RGB color histograms; this serves as a simple proxy for actual content changes and helps reveal the biases of different models. Let $H(x,i)$ denote a 3D color histogram function that computes the value of histogram bin $i\in[1,\dots,N^{3}]$ for the given image $x$ , normalized so that $\sum_{i}H(x,i)=1$ . Given video clip $\boldsymbol{x}=\{x_{t}\}$ and frame separation $t$ , we define the color similarity as

where $S(\boldsymbol{x},t)=1$ indicates that the color histograms are identical between $x_{0}$ and $x_{t}$ . In practice, we set $N=20$ and report the mean and standard deviation of $S(\cdot,t)$ , measured on $1000$ random video clips containing 128 frames each.

Figure 5 shows $S(\cdot,t)$ as a function of $t$ for real and generated videos on each dataset. The curves trend downward over time for real videos as content and scenery gradually change. StyleGAN-V and DIGAN are biased toward colors changing too slowly — both of these models include a global latent code that is fixed over the entire video. On the other extreme, MoCoGAN-HD and TATS are biased toward colors changing too quickly. These models use recurrent and autoregressive networks, respectively, both of which suffer from accumulating errors. Our model closely matches the shape of the target curve, indicating that colors in our generated videos change at appropriate rates.

Color change is a crude approximation of the complex changes over time in videos. In Appendix A.3 we also consider LPIPS perceptual distance instead of color similarly and observe the same trends in most cases.

3 Fréchet video distance (FVD)

The commonly used Fréchet video distance (FVD) attempts to measure similarity between real and generated video distributions. We find that FVD is sensitive to the realism of individual frames and motion over short segments, but that it does not capture long-term realism. For example, FVD is essentially blind to unrealistic repetition of content over time, which is prominent in StyleGAN-V videos on all of our datasets. We found FVD to be most useful in ablations, i.e., when comparing slightly different variants of the same architecture.

FVD computes the Wasserstein-2 distance between sets of real and generated features extracted from a pre-trained I3D action classification model . Skorokhodov et al. note that FVD is highly sensitive to small implementation differences, down to the level of image compression settings, and that the reported results are not necessarily comparable between papers (Appendix C in ). We report all FVD results using consistent evaluation protocol, ensuring apples-to-apples comparison. We separately measure FVD using 128- and 16-frame segments, denoted by $\text{FVD}_{128}$ and $\text{FVD}_{16}$ , and sample 2048 random segments from both the dataset and generator in each case.

Table 1 (left) reports FVD on all datasets for StyleGAN-V and our model. We outperform StyleGAN-V on horseback riding and mountain biking datasets that contain more complex changes over time, but underperform on ACID and slightly underperform on SkyTimelapse in terms of $\text{FVD}_{128}$ . However, this underperformance strongly disagrees with the conclusions from the qualitative user study in Section 5.1. We believe this discrepancy comes from StyleGAN-V producing better individual frames, and possibly better small-scale motion, but falling seriously short in recreating believable long-term realism – and the FVD being sensitive primarily to the former aspects. Table 1 (right) reports FVD metrics on MoCoGAN-HD, TATS, DIGAN and our model for SkyTimelapse at $128^{2}$ ; we outperform all baselines in terms of $\text{FVD}_{128}$ on this comparison.

4 Ablations

Observing long videos during training helps our model learn long-term consistency, which is illustrated in Table 2a that ablates the sequence length used during training of the low-resolution generator. We found that the benefits of training with long videos only became evident after designing a generator architecture with appropriate temporal receptive field to utilize the rich training signal. Note that even though we ablate aspects of the low-resolution generator, we still compute FVD using the final high-resolution videos produced by the super-resolution network.

Footprint of the temporal lowpass filters.

Our temporal latent representation serves a vital role in expanding the receptive field of our generator, modeling patterns over different time scales, and enabling the generation of new content over time. While we primarily leverage long training videos to learn long-term consistencies from data, the size of our temporal lowpass filters plays a role in encouraging the low-resolution mapping network to learn correlations at appropriate time scales. Table 2b demonstrates the negative impact of using inappropriately sized filters. We find that our model performs well with the same filter configuration for all datasets, although it is possible that the ideal settings may vary slightly between datasets.

Effectiveness of the super-resolution network.

Figure 7a,b shows examples of low-resolution frames generated by our model along with the corresponding high-resolution frames produced by our super-resolution network; we find that the super-resolution network generally performs well. To ensure that the quality of our results is not disproportionately limited by the super-resolution network, we further measure FVD when providing the super-resolution network with real low-resolution videos as input in Figure 7c. Indeed, FVD greatly improves in this case, which indicates that there are still significant gains to be realized by further improving the low-resolution generator.

Conclusions

Video generation has historically focused on relatively short clips with little new content over time. We consider longer videos with complex temporal changes, and uncover several open questions and video generation practices worth reassessing — the temporal latent representation and generator architecture, the training sequence length and recipes for using long videos, and the right evaluation metrics for long-term dynamics.

We have shown that representations over many time scales serve as useful building blocks for modeling complex motions and the introduction of new content over time. We feel that the form of the latent space most suitable for video remains an open, almost philosophical question, leaving a large design space to explore. For example, what is the right latent representation to model persistent objects that exit from a video and re-enter later in the video while maintaining a consistent identity?

The benefits we find from training on longer sequences open up further questions. Would video generation benefit from even longer training sequences? Currently we train using segments of adjacent input frames, but it might be beneficial to also use larger frame spacings to cover even longer input sequences, similarly to À-Trous wavelets . Also, what is the best set of augmentations to use when training on long videos to combat overfitting?

Separate low- and super-resolution networks makes the problem computationally feasible, but it may somewhat compromise the quality of the final high-resolution frames — we believe the “swirly” artifacts visible in some of the results are due to this RGB bottleneck. The integration of more advanced video super-resolution methods would likely be beneficial in this regard, and one could also consider outputting additional features from the low-resolution generator in addition to the RGB color to better disambiguate the super-resolution network’s task.

Quantitative evaluation of the results continues to be challenging. As we observed, FVD goes only a part of the way, being essentially blind to repetitive, even very implausible results. Our tests with how the colors and LPIPS distance change as a function of time partially bridge this gap, but we feel that this area deserves a thorough, targeted investigation of its own. We hope our work encourages further research into video generation that focuses on more complex and longer-term changes over time.

Our work falls within data-driven generative modeling, which, as a field, has well known potential for misuse with increasing quality improvements. The training of video generators is even more intensive computationally than training still image generators, increasing energy usage. Our project consumed 300MWh on an in-house cluster of V100 and A100 GPUs.

Acknowledgements

We thank William Peebles, Samuli Laine, Axel Sauer and David Luebke for helpful discussion and feedback; Ivan Skorokhodov for providing additional results and insight into the StyleGAN-V baseline; Tero Kuosmanen for maintaining compute infrastructure; Elisa Wallace Eventing (https://www.youtube.com/c/WallaceEventing) and Brian Kennedy (https://www.youtube.com/c/bkxc) for videos used to make the horseback riding and mountain biking datasets. Tim Brooks is supported by the National Science Foundation Graduate Research Fellowship under Grant No. 2020306087.

References

Appendix A Additional results

We conducted a user study on Amazon Mechanical Turk to gauge realism of motion generated by our method in comparison to StyleGAN-V, as discussed in Section 5.1 of the main paper. While the user study is on a relatively small scale and does not measure all aspects of video quality, it provides an important signal about realism that is not captured by the Fréchet video distance (FVD) metric. FVD does not favor our method on all datasets, but we observe a substantial qualitative improvement regarding generation of motion and introduction of new content over time. The user study shows preference for videos generated by our method on all datasets, corroborating this observation.

For our user study we create 50 pairs of videos for each of the four datasets, where each pair has one random video from our method and one random video from StyleGAN-V. We instruct participants to select the favorable video in a forced-choice response: “Pick the video that is MORE realistic. For each comparison, you will be presented two videos. Please click each video to view it. Please pick the video that contains more realistic motions." See Figure 8 for a screenshot of instructions provided to participants and Table 3 for the portion of responses that favor our method compared to StyleGAN-V. Our method was preferred over 80% of the time for every dataset.

Each video pair was shown to 10 participants resulting in 500 responses per dataset. Each participant gave responses for 5 different video pairs. We select workers who have a past approval rating over 95% and who have completed over 1000 jobs. Our user study uses participants to complete a labeling task to measure video realism; humans are not the subjects and we do not study the participants themselves. IRB review is not applicable. Based on the average completion time, the hourly wage per participant ranged from $6 to$ 9.

A.2 Qualitative results

See Figures 9,10,11,12 for qualitative results of our videos compared with baseline methods. Please also see the supplemental webpage to watch the same videos, as well as watch grids of randomly sampled videos for each dataset and method. In all videos, StyleGAN-V fails to generate new content as the video progresses, and instead replays the same content repeatedly (e.g., clouds moving back and forth for the SkyTimelapse dataset).

A.3 Analyzing change over time in feature spaces

In Section 5.2 of the main paper, we measure color similarity at increasing frame spacings for different datasets and methods to uncover bias in how much change occurs over time. Intersection of color histograms (Equation 1) is a simple proxy for change over time, and is entirely agnostic to spatial patterns. We include the color similarity plots in Figure 13 of the supplement as well for reference. It is reasonable to also consider other distance functions, such as perceptual similarity metrics . In Figure 14 and Figure 15 we show the LPIPS metric based on AlexNet and VGGNet features respectively. (Note the opposite direction of change: color similarity decreases over time, whereas feature distance increases over time.)

In most cases, we observe the same trend as for color similarity — StyleGAN-V changes too slowly in horseback, ACID and SkyTimelapse, and our method does a relatively better job at matching the rate of change in real videos. The mountain biking dataset shows a different trend for perceptual similarity, where both our method and StyleGAN-V curves are shifted too high (too much change), and StyleGAN-V is closer to the dataset curve. One caveat of this use of perceptual metrics is that, even ignoring the temporal aspect, we observe substantial distributional shift of pretrained features between generated and real frames (e.g., penultimate VGG features for both our model and StyleGAN-V have over 30% larger magnitudes than for real frames on the biking dataset). It is thus unclear to what extent the difference in curves between real and generated videos is due to different rates of change over time or the distributional shift of features independent of change over time.

We favor the color similarity measure as the simplest approximation for how quickly things change over time, and acknowledge that it is not intended as a standalone metric but a probe into the biases of videos generated with different methods.

A.4 Image quality tradeoff

In practice, there exists a tradeoff between per-frame image quality and the quality of motion and change over time. At one extreme, an image generator is optimized specifically for image quality. Image generators produce very high quality images, but have no inherent ability to produce realistic videos. Many video generation models prioritize frame quality, whereas our model prioritizes accurate changes over long durations. $\text{FVD}_{128}$ and $\text{FVD}_{16}$ metrics measure unknown combinations of spatial and temporal patterns, and while they provide a useful signal, it is not clear where these metrics fall in terms of favoring per-frame image quality or accurate temporal changes.

We analyze color similarity over time in Section 5.2 of the main paper. Color similarity between frames is agnostic to spatial patterns, and provides insight on the rate of change over time in isolation from per-frame image quality. To gain a holistic picture of the priorities of our model, we also compute a per-frame image quality metric, video-balanced Fréchet inception distance ( $\text{FID}_{\text{V}}$ ), which we describe below and report in Table 4. StyleGAN-V outperforms our model on three of the four datasets in terms of $\text{FID}_{\text{V}}$ . This tradeoff is expected, since StyleGAN-V is heavily based on the StyleGAN2 image generator. It produces high image quality but is unable to model complex motions or changes over time, whereas our model prioritizes the time axis.

Assessing quality of generated videos is multifaceted, and we believe all of the evaluation we provide — qualitative results, user study, color change over time, FVD, and FID — help expose gaps in the abilities of existing methods and the strengths and weaknesses of our new model.

To correctly measure per-frame image quality, it is important to balance the computation of FID such that very long videos in the dataset do not overpower results. (This is particularly important for the SkyTimelapse dataset, which has an outlier video that is extremely long.) Skorokhodov et al. point out that it is undesirable for these very long videos to bias training or computing FVD , and the same is true for computing FID per-frame on video data.

To correctly balance FID to value each training video equally, we weight calculation of the covariance and mean by the inverse of the number of frames in each clip when measuring the Wasserstein-2 distance between sets of features. This has the effect of valuing each video equally, while still including contribution from all frames, which is important when there are a small number of long videos such as in our horseback riding dataset. A similar strategy to weight covariance and mean when computing FID is used by Kynkäänniemi et al. to analyze the effect of balancing object class occurrences. When computing statistics for generated frames, we sample $50\,000$ videos of length 1 frame (at $t=0$ for StyleGAN-V).

Appendix B Dataset details

We evaluate our model using two existing datasets, Aerial Coastline Imagery Dataset (ACID) and SkyTimelapse , and two new datasets: horseback riding and mountain biking. We center crop videos to the desired aspect ratio if needed ( $16{\mkern-2.0mu\times\mkern-2.0mu}9$ for all datasets except SkyTimelapse, for which we use a square crop to match prior work), and then resize to the target resolution using the PIL library’s Lanczos resampling method. For the ACID dataset we combine both train and test splits to maximize the amount of training data. For the SkyTimelapse dataset we use only the train split to ensure our model is comparable with prior work.

Figure 16 shows histograms of the durations and counts of training videos for all four datasets. Our new datasets both feature longer median clip lengths than the existing datasets. When training our model, we filter ACID and SkyTimelapse datasets for clips with at least 128 frames. We allow the StyleGAN-V baseline to train on all clips with at least 3 frames (the number needed by their method). Both datasets can be obtained from their respective project webpages. ACID: https://infinite-nature.github.io/, and SkyTimelapse: https://sites.google.com/site/whluoimperial/mdgan. The copyright status of both existing datasets is ambiguous, as neither specify a license or details about content ownership. We ensure to attain explicit licenses for our two new datasets below.

We introduce a new dataset of first-person horseback riding that we will release to the public for research purposes. The videos were created by Wallace Eventing and examples of the videos can be found on their YouTube channel: https://www.youtube.com/c/WallaceEventing. We reached out directly and received permission to create a dataset from their videos to use in our research and release as a dataset for non-commercial research purposes. We will release the filtered and processed video frames directly, which avoids inconsistent versions of the dataset when videos become unavailable or are processed differently. The dataset will be released under a custom license agreed upon with Wallace Eventing that permits use for non-commercial research purposes but does not allow redistribution of the dataset.

The videos contain first-person helmet camera footage of horseback riding events, with little or no personally identifying information visible. They are high quality (1080p) at 60fps, although we subsample frames to attain 30fps. Statistics of our dataset filtering are presented in Table 5. The dataset was sourced from 194 original videos, which we then filtered down to 44 videos with stabilized motion and a consistent camera perspective. We manually extracted 66 clips from the selected videos, cutting out scene changes, text overlays, videos with obstructed views, and the beginnings and ends of videos.

B.2 Mountain biking

We also introduce a new dataset of first-person mountain biking that we will release to the public. The videos were created by Brian Kennedy (BKXC) and examples of the videos can be found on their YouTube channel: https://www.youtube.com/c/bkxc. We reached out directly and received permission to create a dataset from their videos to use in our research and release as a dataset under a CC BY 4.0 license.

The videos contain first-person mountain biking. There is little personally identifying information visible, although there are occasional other bikers who pass by and whose faces can be seen. The videos are high quality (2160p) at 30fps. This dataset underwent much more extensive filtering and extraction of training clips since the source videos contain many cuts and abrupt changes. See Table 5 for statistics of our dataset curation. From 48 source videos we selected 28 videos with ample footage of stable mountain biking, and then manually filtered for contiguous segments of mountain biking that were at least 5 seconds long, resulting in 1202 total clips.

Appendix C Low-resolution implementation details

We find that overfitting of the discriminator network is particularly severe when training with long sequences. To alleviate the overfitting, we apply DiffAug to real and generated videos prior to the discriminator. We use all categories of DiffAug augmentations — color, cutout, and translation — with default strengths for color and cutout augmentations, and maximum x- and y-translations of 32 pixels for the square SkyTimelapse dataset and 16 pixels for the non-square biking, horseback and ACID datasets. We also tried using the ADA adaptive augmentation strategy, but it caused leakage of augmentations into the generated videos, even when augmentations were applied with low probability.

In addition to DiffAug, we employ fractional time stretching augmentation, where we resize the temporal axis by a factor of $s=2^{a}$ for $a\sim\mathcal{U}(-1,1)$ with linear interpolation and zero padding. If time stretching augmentation upsamples the time axis, the video is randomly cropped to fit within the original 128-frame window. Similarly, if time stretching augmentation downsamples the time axis, the video is zero padded with random amounts before and after to fit within the original 128-frame window. Fractional time stretching augmentation is related to subsampling augmentation that is commonly used by other methods , but supports a greater variety of augmentations since temporal scaling amounts are fractional. Further investigation into the best augmentation policies for video generation models is an important future area for investigation.

C.2 Temporal lowpass filters

To capture long-term temporal correlations in the intermediate latent codes, we enrich each of 8 channels of input temporal noise with a set of $N=128$ lowpass filters $\{f_{i}\}$ , as described in Section 3.1 of the main paper. Specifically, we use Kaiser lowpass filters , following the implementation of . We space lowpass filter sizes exponentially, where each filter has temporal footprint $k_{i}=k_{\text{min}}\big{(}\frac{k_{\text{max}}}{k_{\text{min}}}\big{)}^{\frac{i}{N-1}}$ where $0\leq i<N$ , $k_{\text{min}}=500$ and $k_{\text{max}}=10000$ .

C.3 Discriminator architecture

Our low-resolution discriminator architecture is heavily inspired by the StyleGAN discriminators, with the addition of spatiotemporal and temporal processing in order to model realistic motions and changes over time. See Figure 17 for a depiction of the discriminator architecture.

The video is first expanded from 3 RGB channels to 128 channels using a $1{\mkern-2.0mu\times\mkern-2.0mu}1$ convolutional layer. The first block only operates spatially, downsampling height and width by $2\times$ and using $3{\mkern-2.0mu\times\mkern-2.0mu}3$ spatial convolutions. The remaining 3 blocks downsample both spatially and temporally and use $5{\mkern-2.0mu\times\mkern-2.0mu}3{\mkern-2.0mu\times\mkern-2.0mu}3$ spatiotemporal convolutions. We omit temporal processing from the first block to save compute, since running 3D convolutions at the full resolution is substantially more expensive. We otherwise find the inclusion of temporal processing crucial for the model to learn temporal dynamics. In each block, the number of channels is doubled until reaching 512.

To further prioritize learning accurate motions and changes over time, we include $4{\mkern-2.0mu\times\mkern-2.0mu}$ 1D temporal convolutions, each with a kernel size of 5 and followed by a LeakyReLU nonlinearity. Finally, following the StyleGAN discriminator, features are flattened and passed through 2 linear layers with a LeakyReLU nonlinearity in between to produce the final logits.

C.4 Training

We use a batch size of 64 videos, each of length 128 frames. We trained models with a variety of single- and multi-node jobs. We train each run for a maximum of $100\,000$ steps and cut training runs short if FVD begins increasing. Training the low-res generator takes $1.7$ days for the maximum $100\,000$ steps using $4\times$ nodes each containing $8\times$ NVIDIA A100 GPUs. The low-res generator has 83.2M parameters and the low-res discriminator has 46.4M parameters. We use R1 regularization with $\gamma=1$ for non-square datasets, and $\gamma=4$ for the square SkyTimelapse dataset. We train with the Adam optimizer with generator learning rate of $0.003$ , discriminator learning rate of $0.002$ , and $\beta_{1}=0$ and $\beta_{2}=0.99$ for both generator and discriminator. (Note: Adam with $\beta_{1}=0$ is equivalent to RMSprop with the bias correction term from Adam.) We use an exponential moving average of the generator weights, with $\beta_{\text{ema}}=0.99985$ . We select the checkpoint with best $\text{FVD}_{128}$ .

Appendix D Super-resolution implementation details

The super-resolution network undergoes augmentation of two forms: (1) augmentation of real and generated videos applied prior to the discriminator to prevent overfitting, and (2) augmentation of conditional real low resolution videos during training to improve generalization to generated low resolution videos at inference time.

Augmentation to prevent discriminator overfitting uses ADA with default settings, and applies the same augmentations to all frames from both high and low resolution videos. To additionally prevent overfitting and prevent the discriminator from focusing too much attention on the conditioning signal, we employ strong dropout augmentation with probability $p=0.9$ of zeroing out the entire conditional low resolution video. This augmentation occurs before the discriminator only, and does not affect the inputs to the super-resolution network.

Low-resolution conditioning augmentation to improve generalization

We train our super-resolution network with real low resolution videos as conditioning, but use generated low resolution videos at inference time. There exists a domain gap between the real and generated low resolution videos, and to ensure our super-resolution network is robust to the domain gap, we augment real low resolution videos during training. Similar strategies are used in image generators with super-resolution refinement , where corruption is added to real low resolution inputs during training. We use a modified version of the ADA augmentation pipeline, only enabling additive Gaussian noise, isotropic and non-isotropic scaling, rotation, and fractional translation. Each augmentation is applied to the entire low resolution video with a fixed probability of $50\%$ , and with much smaller strengths than the default pipeline (noise_std=0.08, scale_std=0.08, aniso_std=0.08, rotate_max=0.016, xfrac_std=0.016). This augmentation is applied in the dataset pipeline and affects conditional inputs to the discriminator and super-resolution network only during training.

D.2 Prefiltering of low-res conditioning

The low resolution frame being upsampled is concatenated with $4$ frames before and $4$ frames after in the low resolution video sequence creating a stack of $9$ low resolution frames. The stack is then resized and concatenated with features at each layer of the StyleGAN3 generator. We experimented with different prefiltering strengths when resizing the $9$ conditioning frames, and found that strong prefiltering helps remove aliasing in the final video. This is related to the anti-aliasing properties of the StyleGAN3 generator that includes strong filtering of intermediate features . Importantly, we do not prefilter the conditional frames when the input is the same resolution as the features (i.e., $64{\mkern-2.0mu\times\mkern-2.0mu}64$ ) since we found that negatively impacts the results. We only apply prefiltering when resizing, and we use the same prefiltering kernels as early layers of StyleGAN3.

D.3 Training

We use a batch size of 32 videos. The discriminator network inputs real and generated videos of length 4 frames, and for each generated frame the super-res network is provided 9 input frames (4 neighboring frames on either side of the primary frame) to provide temporal context. The network architectures share details with StyleGAN3 , except the differences mentioned in Section 3.2 of the main paper. We train for a maximum of $275\,000$ steps, which takes 6.8 days using one node of $8\times$ 16GB NVIDIA V100 GPUs. The super-res network has 27.2M parameters, and the discriminator network has 24.0M parameters. We use R1 regularization with $\gamma=1$ for all datasets. We train with the Adam optimizer with generator and discriminator learning rate of $0.003$ , $\beta_{1}=0$ and $\beta_{2}=0.99$ . We use an exponential moving average of the generator weights with $\beta_{\text{ema}}=0.99985$ . We select the checkpoint with best $\text{FVD}_{16}$ when evaluated using real low resolution conditioning, and use the same super-resolution network for many low-resolution experiments.