StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

Ivan Skorokhodov, Sergey Tulyakov, Mohamed Elhoseiny

Introduction

Recent advances in deep learning pushed image generation to the unprecedented photo-realistic quality and spawned a lot of its industry applications. Video generation, however, does not enjoy a similar success and struggles to fit complex real-world datasets. The difficulties are caused not only by the more complex nature of the underlying data distribution, but also due to the computationally intensive video representations employed by modern generators. They treat videos as discrete sequences of images, which is very demanding for representing long high-resolution videos and induces the use of expensive conv3d-based architectures to model them . E.g., DVD-GAN requires {\approx}\30ktotrainonk to train on256^{2}$ resolution

Developing such a framework comes with three challenges. First, sine/cosine positional embeddings are periodic by design and depend only on the input coordinates. This does not suit video generation, where temporal information should be aperiodic (otherwise, videos will be cycled) and different for different samples. Next, since videos are perceived as infinite continuous signals, one needs to develop an appropriate sampling scheme to use them in a practical framework. Finally, one needs to accordingly redesign the discriminator to work with the new sampling scheme.

To solve the first issue, we develop positional embeddings with time-varying wave parameters which depend on motion information, sampled uniquely for different videos. This motion information is represented as a sequence of motion codes produced by a padding-less conv1d-based model. We prefer it over the usual LSTM network to alleviate the RNN’s instability when unrolled to large depths and to produce frames non-autoregressively.

Next, we investigate the question of how many samples are needed to learn a meaningful video generator. We argue that it can be learned from extremely sparse videos (as few as 2 frames per clip), and justify it with a simple theoretical exposition (§3.3) and practical experiments (see Table 2).

Finally, since our model sees only 2-4 randomly sampled frames per video, it is highly redundant to use expensive conv3d-blocks in the discriminator, which are designed to operate on long sequences of equidistant frames. That’s why we replace it with a conv2d-based model, which aggregates information temporarily via simple concatenation and is conditioned on the time distances between its input frames. Such redesign improves training efficiency (see Table 1), provides more informative gradient signal to the generator (see Fig 4) and simplifies the overall pipeline (see §3.2), since we no longer need two different discriminators to operate on image and video levels separately, as modern video synthesis models do (e.g., ).

We build our model, named StyleGAN-V, on top of the image-based StyleGAN2 . It is able to produce arbitrarily long videos at arbitrarily high frame-rate in a non-autoregressive manner and enjoys great training efficiency — it is only 5%{\approx}5\% costlier than the classical image-based StyleGAN2 model , while having only 10%{\approx}10\% worse plain image quality in terms of FID (see Fig 3). This allows us to easily scale it to HQ datasets and we demonstrate that it is directly trainable on 102421024^{2} resolution.

For empirical evaluation, we use 5 benchmarks: FaceForensics 2562256^{2} , SkyTimelapse 2562256^{2} , UCF101 2562256^{2} , RainbowJelly 2562256^{2} (introduced in our work) and MEAD 102421024^{2} . Apart from our model, we train from scratch 5 different methods and measure their performance using the same evaluation protocol. Frechet Video Distance (FVD) serves as the main metric for video synthesis, but there is no complete official implementation for it (see §4 and Appx C). This leads to discrepancies in the evaluation procedures used by different works because FVD, similarly to FID , is very sensitive to data format and sampling strategy . That’s why we implement, document and release our complete FVD evaluation protocol. In terms of sheer metrics, our method performs on average 30%{\approx}30\% better than the closest runner-up.

Related work

Video synthesis. Early works on video synthesis mainly focused on video prediction , i.e. generating future frames given a sequence of the previously seen ones. Early approaches for this problem typically employed recurrent convolutional models trained with reconstruction objective , but later adversarial losses were introduced to improve the synthesis quality . Some recent works explore autoregressive video prediction with recurrent or attention-based models (e.g., ). Another close line of research is video interpolation, i.e. increasing the frame rate of a given video (e.g., ). In our work, we study video generation, which is a more challenging problem than video prediction since it seeks to synthesize videos from scratch, i.e. without using the expressive conditioning on previous frames. Classical methods in this direction are typically based on GANs . MoCoGAN and TGAN decompose generator’s input noise into a content code and motion codes, which became a standard strategy for many subsequent works (e.g., ). Several approaches consider video generation from a single clip (e.g., ).

Some recent works also consider high-resolution video synthesis , but only with training in the latent space of a pretrained image generator. StyleGAN-V is trained on extremely sparse videos. This makes it related to , which use a pyramid of discriminators operating on different temporal resolutions (with a subsampling factor of up to ×8\times 8). Our model builds on the time continuity, which in the context of video synthesis was also explored by .

To the best of our knowledge, all modern video synthesis approaches utilize expensive conv3d blocks either in their decoder and/or encoder components (e.g., ). Often, GAN-based approaches utilize two discriminators, operating on image and video levels independently, where the video discriminator operates at a low resolution to save computation (e.g., ). In our work, we aggregate the temporal information via a simple concatenation of feature vectors extracted from the frames and this strategy suffices to build a state-of-the-art video generator.

Neural Representations. Neural representations is a recent paradigm that uses neural networks to represent continuous signals, such as images, videos, audios, 3D objects and scenes (e.g., ). It is mostly popular for 3D reconstruction and geometry processing tasks (e.g., ), including video-based reconstruction . Several recent projects explored the task of building generative models over such representations to synthesize images (e.g., ), 3D objects (e.g., ) or multi-modal signals (e.g., ), and our work extends this line of research to video generation.

Concurrent works. The development of neural representations-based approaches moves extremely fast and there are two concurrent works which propose ideas similar to our ones. DIGAN is a concurrent project that explores the same direction of using neural-based representations for continuous video synthesis and shares a lot of ideas with our work. The authors also consider a continuous-time generator, trained by a discriminator without conv3d layers. The core difference with our work is that they use a different parametrization of motions and use a dual discriminator D\mathsf{D}: one operates on (x1,x2,Δt)(\bm{x}_{1},\bm{x}_{2},\Delta t) and the second one on individual images. We enumerate the differences and similarities in Appx H. NeRV uses convolutional neural representations of videos for compression and denoising tasks. GEM utilizes generative latent optimization to build a multi-modal generative model.

Model

Overview. Generator consists of three components: content mapping network Fc\mathsf{F}_{\text{c}}, motion mapping network Fm\mathsf{F}_{\text{m}} and synthesis network S\mathsf{S}. Fc\mathsf{F}_{\text{c}} and S\mathsf{S} are borrowed from StyleGAN2 and we only modify S\mathsf{S} by tiling and concatenating motion codes vt\bm{v}_{t} to its constant input tensor.

Acyclic positional encoding. Traditional positional embeddings are cyclic by default. This does not create problems in traditional applications (like image or scene representations) because utilized spatial domain there never exceeds the period length . But for video generation, cyclicity is not desirable, because it makes a video getting looped at some point. To solve this issue, we develop acyclic positional encoding.

In practice, we found it useful to compute periods as:

where \mathds1\mathds{1} is a vector of ones and σ\bm{\sigma} are linearly-spaced scaling coefficients. See Appx B and the source code for details.

2 Discriminator structure

Such a design is greatly more efficient than using both image and video discriminators and provides a more informative learning signal to the generator (see Fig 4).

3 Implicit assumptions of sparse training

Consider the problem of learning a probability distribution p(x)=p(x1,...,xn)p(\bm{x})=p(x_{1},...,x_{n}) and consider that we utilize sparse training, i.e. select kk coordinates of vector x\bm{x} randomly on each iteration of the optimization process. Then the optimization objective is equivalent to learning all possible marginal distributions p(xi1,....,xxk)p(x_{i_{1}},....,x_{x_{k}}) instead of learning joint p(x)p(\bm{x}). When does learning marginals allow to obtain the full joint distribution at the end? The following simple statement adds some clarity to this question.

Let’s denote by J<ik\mathcal{J}_{<i}^{k} a collection of sets JiJ_{i} of up to kk indices jj s.t. JiJ<ik\forall J_{i}\in\mathcal{J}_{<i}^{k} we have j<ij<i for all jJij\in J_{i}. In other words, JiJ_{i} is a set of up to kk indices j[1,i)j\in[1,i). Then, p(x)p(\bm{x}) can be represented as a product of nn marginals p(xi,xJi)p(x_{i},\bm{x}_{J_{i}}) for i[1,n]i\in[1,n] if and only if i\forall i there exists JiJ<ik1J_{i}\in\mathcal{J}_{<i}^{k-1} s.t. p(xix<i)p(xixJi)p(x_{i}|\bm{x}_{<i})\equiv p(x_{i}|\bm{x}_{J_{i}}).

The above statement is primitive (see the proof in Appx F) but can provide useful practical intuition. For video synthesis, it implies that one can learn a video generator by using only kk frames per video only if for any frame xi\bm{x}_{i}, there exists at most k1k-1 previous frames sufficient to properly predict it (see Appx F). And we argue that very few frames suffice to make such a prediction for the modern video synthesis benchmarks. For example, in SkyTimelapse , the motions are typically unidirectional and thus easily predictable from only 2 previous frames, which corresponds to training with k=3k=3 frames per video.

We treat videos as infinite continuous signals, but in practice one has to set a limit on the maximum time location TT which can be seen during training. To the best of our knowledge, previous methods use at most T=64T=64 , but in our case we easily train the model with T=1024T=1024 since our generator is non-autoregressive and our discriminator uses only the relative temporal information. We set the maximum distance between t1t_{1} and tkt_{k} to 32 to cover short and medium-term movements: otherwise, we observed unstable training and abrupt motions. To sample frames, we first sample the distance (tkt1)U[k1,32](t_{k}-t_{1})\sim U[k-1,32] between them, and then sample the offset t1U[0,Ttk]t_{1}\sim U[0,T-t_{k}]. After that, frames locations tit_{i} for i{2,...,k1}i\in\{2,...,k-1\} are selected at random without repetitions.

Experiments

Datasets. We test our model on 5 benchmarks: FaceForensics 2562256^{2} , SkyTimelapse 2562256^{2} , UCF101 2562256^{2} , RainbowJelly 2562256^{2} (introduced by us and described in Appx E) and MEAD 102421024^{2} . We used the train splits (when available) for all the datasets except for UCF101, where we used train+test splits. We provide the datasets details in Appx E.

Evaluation. Following prior work, we use Frechet Video Distance (FVD) and Inception Score (IS) as our evaluation metrics with FVD being the main one since FID (its image-based counterpart) better aligns with human-perceived quality . We use two versions of FVD: FVD16 and FVD128, which use 16 and 128-frames-long videos to compute the statistics. Inception Score is used only to evaluate the generation quality on UCF-101 since it uses a UCF-101-finetuned C3D model .

The official FVD implementation does not provide a complete evaluation pipeline, but rather an inference script for a single batch of videos, which are required to be already resized to 2562256^{2} and loaded into memory. This creates discrepancies in the evaluation protocols used by previous works since FVD (similar to FID ) is very sensitive to the subsampling and data processing procedures. We implement, document (see Appx C) and release a complete FVD evaluation protocol and use it to evaluate all the methods.

Baselines. We use 5 baselines for comparison: MoCoGAN , MoCoGAN with the StyleGAN2 backbone, VideoGPT , MoCoGAN-HD and DIGAN . For MoCoGAN with the StyleGAN2 backbone (denoted as MoCoGAN-SG2), we replaced its generator and image-based discriminator with the corresponding StyleGAN2’s components, leaving its video discriminator unchanged. We also used the training scheme and regularizations from StyleGAN2. MoCoGAN was trained for 5 days on a single GPU since its lightweight DC-GAN backbone makes it fast to train, while MoCoGAN+SG2 was trained for 2 days on ×4\times 4 GPUs to reach 25M real images seen by its image-based discriminator. MoCoGAN-HD is trained for {\approx}4.5 days on ×4\times 4 v100 GPUs, as specified in the original paper (Appx B of ). We trained VideoGPT for the maximum affordable total time of 32 GPU-days in our resource constraints. DIGAN was trained for 4{\approx}4 days since after that its FVD score either did not change or exploded (on RainbowJelly). We also replaced its weighted sampling strategy (selecting clips from longer videos with higher probabilities) with the uniform one, which is used by other methods . For each method, we used the checkpoint with the lowest FVD16 value.

For the main evaluation, we train our method and all the baselines from scratch on the described 2562256^{2} datasets. Each model is trained on ×4\times 4 NVidia V100 32 GB GPUs, except for VideoGPT, which is very demanding in terms of GPU memory for 2562256^{2} resolution and we had to train it on ×4\times 4 NVidia A6000 instead (with the overall batch size of 4). For our method and MoCoGAN+SG2, we use exactly the same optimization scheme as StyleGAN2, including the loss function, Adam optimizer hyperparameters and R1 regularization . We reduce the learning rate by 10 for the DVD_{V} module of MoCoGAN+SG2 since it does not have equalized learning rate . We use δz=16\delta^{z}=16 for all the experiments except for SkyTimelapse, where we used δz=256\delta^{z}=256. See other training details in Appx B. We evaluate all the methods under the same evaluation protocol, described in Appx C and report the results in Table 1.

To measure the efficiency, we use the amount of GPU days required to train a method. We build on top of the official StyleGAN2 implementation.https://github.com/NVlabs/stylegan2-ada-pytorch The training cost of the image-based StyleGAN2 to reach its specified 25M images is 7.727.72 NVidia V100 GPU-days in our environment. StyleGAN-V is trained for 2 days, which corresponds to {\approx}23M real frames seen by the discriminator. MoCoGAN-HD is built on top of stylegan2-pytorch’s codebasehttps://github.com/rosinality/stylegan2-pytorch, which is 2{\approx}2 times slower than the highly optimized NVidia’s implementation. That’s why in Table 1 we report its training cost reduced by a factor of 2 to account for this.

Our method significantly outperforms the existing ones on almost all the benchmarks in terms FVD16 and FVD128. We visualize the samples in Fig 1 and Fig 7. Our method is able to generate hour-long plausibly looking videos, though the motion diversity and global motion coherence for them would be limited (see Appx A). MoCoGAN-HD suffers from the LSTM instability when unrolled to large lengths and does not produce diverse motions. DIGAN produces high-quality videos on SkyTimelapse because its inductive bias of having joint spatio-temporal positional information is well suited for videos that have an entire scene moving. But for FaceForensics, this leads to a “head flying away” effect (see Appx H). To generate 1-hour long videos from MoCoGAN-HD, we unroll its LSTM model to the required depth (90k{\approx}90k steps) and synthesize frames only in the necessary time positions, while DIGAN, similar to our method, is able to generate frames non-autoregressively.

2 Ablations

To ablate the core components, we replaced G\mathsf{G} or D\mathsf{D} modules with their MoCoGAN+SG2 counterparts. In the both cases, their removal leads to poor short-term and long-term video quality, as specified by the corresponding metrics in Table 2 and video samples in the supplementary.

Replacing continuous motion codes v(t)\bm{v}(t) with u(t)\bm{u}(t), produced by LSTM hurts the performance, especially when the distance δz\delta^{z} between motion codes is small. This happens due to unnaturally abrupt transitions between frames and we provide the corresponding samples in the supplementary. The corresponding results are in Table 2.

We also verify the importance of the conditioning in D\mathsf{D} and denote the experiment where it’s disabled as “w/o time conditioning” in Table 2. Removing the time conditioning hurts the performance, because it constrains the ability of D\mathsf{D} to understand the temporal scale it is currently operating on.

An important design choice is how many samples per video one should use during training. We try different values of kk for k=2,3,4,8k=2,3,4,8 and 1616 and report the corresponding results in Table 3. As being discussed in §3.3, for existing video generation benchmarks, it might be enough to sample only several frames per each video, and our experiments confirm this observation. The performance is decreased for larger kk, but this might be attributed to a weaker temporal aggregation procedure of D\mathsf{D}, which simply concatenates features together. It is surprising to see that modern datasets can be fit with as few as 2 samples per video.

3 Properties

Our generator is able to generate arbitrarily long videos. Our design of motion codes allows StyleGAN-V not to suffer from stability problems when unrolled to large (potentially infinite) video lengths. This is verified by visualizing the video clips for the extremely large timesteps in Fig 1 and Fig 8. We also demonstrate its ability to produce videos in arbitrarily high frame-rate in the supplementary.

Our model has the same latent space manipulation properties as StyleGAN2. To show this, we conduct two experiments: embedding, editing and animating an off-the-shelf image and editing and animating the first frame of a generated video. To embed an image, we used the optimization procedure similar to , but considering it to be positioned at t=0t=0. To edit an image with CLIP, we used the procedure of . The results of these experiments are visualized in Fig 2 and we provide the details in Appx B and more examples in the supplementary. Apart from showing the good properties of its latent space, these experiments demonstrate the extrapolation potential of our generator.

StyleGAN-V has almost the same training efficiency and image quality as StyleGAN2. In Fig 3, we plot the FID scores (computed from 16-frames videos) and training costs of modern video generators on FaceForensics 2562256^{2} by their corresponding FVD16 scores. Our method comes very close to StyleGAN2: it converges to FID of 9.44 in 8 GPU-days compared to FID of 8.42 in 7.72 GPU-days for StyleGAN2, which is only {\approx}10% worse. This raises the question whether video generators can be as computationally efficient and good in terms of image quality as image ones.

Our model is the first one which is directly trainable on 102421024^{2} resolution. We provide the generations on MEAD 102421024^{2} for our method and for MoCoGAN-HD. MoCoGAN-HD cannot preserve the identity of a speaker and diverges for large video lengths, while our method achieves comparable image quality and coherent motions. For this dataset, our model was trained for 7 days on ×4\times 4 NVidia v100 GPUs and obtained FID of 24.12 and FVD16 of 156.1. Image generator for MoCoGAN-HD was trained for 1414 days on ×4\times 4 A6000 GPUs, while its video generator was trained for only 55 days since it didn’t require high-resolution training.

Our discriminator provides more informative learning signal to G\mathsf{G}. Fig 4 visualizes the gradient signal to the generator from our discriminator and the conv3d-based video discriminator of MoCoGAN-HD, measured at 50%{\approx}50\% of training for our method (at 10M images seen by D\mathsf{D}) and MoCoGAN-HD (at the 300-th epoch). In our case, one can easily see fine-grained details of the face structure, perceived by D\mathsf{D}, while in case of MoCoGAN-HD, most of the gradient is redundant and lack any structural information.

Content and motion decomposition. Similar to MoCoGAN , our generator captures content and motion variations in a disentangled manner: altering motion codes zt0m,...,ztnm\bm{z}^{\textsf{m}}_{t_{0}},...,\bm{z}^{\textsf{m}}_{t_{n}} while fixing zc\bm{z}^{\textsf{c}} does not change the appearance variations (like, a speaker’s identity). Similarly, re-sampling zc\bm{z}^{\textsf{c}} does not influence motion patterns on a video, but only its content. We provide the corresponding visualizations on the project website.

Conclusion

In this work, we provided a different perspective on time for video synthesis and built a continuous video generator using the paradigm of neural representations. For this, we developed motion representations through the lens of positional embeddings, explored sparse training of video generators and redesigned a typical dual structure of a video discriminator. Our model is built on top of StyleGAN2 and features a lot of its perks, like efficient training, good image quality and editable latent space. We hope that our work would serve as a solid basis for building more powerful video generators in the future. The limitations and potential negative impact are discussed in Appx A.

References

Appendix A Limitations and potential negative impact

Limitations of sparse training. In general, sparse training makes it impossible for D\mathsf{D} to capture complex dependencies between frames. But surprisingly, it provides state-of-the-art results on modern datasets, which (using the statement from §3.3) implies that they are not that sophisticated in terms of motion.

Dataset-induced limitations. Similar to other machine learning models, our method is bound by the dataset quality it is trained on. For example, for FaceForensics 2562256^{2} dataset , our embedding and manipulations results are inferior to StyleGAN2 ones . This is due to the limited number of identities (just 700) in FaceForensics and their larger diversity in terms of quality compared to FFHQ , which StyleGAN2 was trained on.

Periodicity artifacts. G\mathsf{G} still produces periodic motions sometimes, despite of our acyclic positional embeddings. Future investigation on this phenomena is needed.

Poor handling of new content appearing. We noticed that our generator tries to reuse the content information encoded in the global latent code as much as possible. It is noticeable on datasets where new content appears during a video, like Sky Timelapse or Rainbow Jelly. We believe it can be resolved using ideas similar to ALIS .

Sensitivity to hyperparameters. We found our generator to be sensitive to the minimal initial period length maxiσi\max_{i}\sigma_{i} (See Appx B). We increased it for SkyTimelapse from 16 to 256: otherwise it contained unnatural sharp transitions.

We plan to address those limitations in our future works.

A.2 Potential negative impact

The potential negative impact of our method is similar to those of traditional image-based GANs: creating “deepfakes” and using them for malicious purposes.https://en.wikipedia.org/wiki/Deepfake. Our model made it much easier to train a model which produces much more realistic video samples with a small amount of computational resources. But since the availability of high-quality datasets is very low for video synthesis, the resulted model will fall short compared to its image-based counterpart, which could use rich, extremely qualitative image datasets for training, like FFHQ .

Appendix B Implementation and training details

Note, that all the details can be found in the source code: https://github.com/universome/stylegan-v.

Our model is built on top of the official StyleGAN2-ADA repositoryhttps://github.com/nvlabs/stylegan2-ada. In this work, we build a model to generate continuous videos and a reasonable question to ask was why not use INR-GAN instead (like DIGAN ) to have fully continuous signals? The reason why we chose StyleGAN2 instead of INR-GAN is that StyleGAN2 is amenable to the mixed-precision training, which makes it train 2{\approx}2 times faster. For INR-GAN, enabling mixed precision severely decreases the quality and we hypothesize the reason if it is that each pixel in INR-GAN’s activations tensor carries more information (due to the spatial independence) since the model cannot spatially distribute information anymore. And explicitly restricting the range of possible values adds a strict upper bound on the amount of information one each pixel is able to carry. We also found that adding coordinates information does not improve video quality for our generator neither qualitatively, nor in terms of scores.

Similar to StyleGAN2, we utilize non-saturating loss and R1R_{1} regularization with the loss coefficient of 0.2 in all the experiments, which is inherited from the original repo and we didn’t try any hyperparameter search for it. We also use the fmaps parameter of 0.5 (the original StyleGAN2 used fmaps parameter of 1.0), which controls the channel dimensionalities in G\mathsf{G} and D\mathsf{D}, since it is the default setting for StyleGAN2-ADA for 2562256^{2} resolution. This allowed us to further speedup training.

The dimensionalities of w,z,ut,vt\bm{w},\bm{z},\bm{u}_{t},\bm{v}_{t} are all set to 512.

As being stated in the main text, we use a padding-less conv1d-based motion mapping network Fm\mathsf{F}_{\text{m}} with a large kernel size to generate raw motion codes ut\bm{u}_{t}. In all the experiments, we use the kernel size of 1111 and stride of 11. We do not use any dilation in it despite the fact that they could increase the temporal receptive field: we found that varying the kernel size didn’t produce much benefit in terms of video quality. Using padding-less convolutions allows the model to be stable when unrolled at large depths. We use 2 layers of such convolutions with a hidden size of 512. Another benefit of using conv1d-based blocks is that in contrast to LSTM/GRU cells one can practically incorporate equalized learning rate scheme into it.

Using conv1d-based motion mapping network without paddings forces us to use “previous” motion noise codes ztm\bm{z}^{\textsf{m}}_{t}. That’s why instead of sampling a sequence zt0m,...,ztnm\bm{z}^{\textsf{m}}_{t_{0}},...,\bm{z}^{\textsf{m}}_{t_{n}}, we sample a slightly larger one to adjust for the reduced sequence size. For the same-padding strategy, for sampling a frame at position t[tn1,tn)t\in[t_{n-1},t_{n}), we would need to produce nn motion noise codes zm\bm{z}^{\textsf{m}}. But with our kernel size of 11, with 2 layers of convolutions and without padding, the resulted sequence size is n+20n+20.

The training performance of VideoGPT on UCF101 is surprisingly low despite the fact that it was developed for such kind of datasets . We hypothesize that this happens due to UCF101 being a very difficult dataset and VideoGPT being trained with the batch size of 4 (higher batch size didn’t fit our 200 GB GPU memory setup), which damaged its ability to learn the distribution.

To train our model, we also utilized adaptive differentiable augmentations of StyleGAN2-ADA , but we found it important to make them video-consistent, i.e. applying the same augmentation for each frame of a video. Otherwise, the discriminator starts to underperform, and the overall quality decreases. We use the default bgc augmentations pipe from StyleGAN2-ADA, which includes horizontal flips, 90 degrees rotations, scaling, horizontal/vertical translations, hue/saturation/brightness/contrast changes and luma axis flipping.

While training the model, for real videos we first select a video index and then we select random clip (i.e., a clip with a random offset). This differs from the traditional DIGAN or VideoGPT training scheme, that’s why we needed to change the data loaders to make them learn the same statistics and not get biased by very long videos.

To develop this project, 7.5{\approx}7.5 NVidia v100 32GB GPU-years + 0.3{\approx}0.3 NVidia A6000 GPU-years were spent.

B.2 Projection and editing procedures

In this subsection, we describe the embedding and editing procedures, which were used to obtain results in Fig 2.

Projection. To project an existing photogrpah into the latent space of G\mathsf{G}, we used a procedure from StyleGAN2 , but projecting into W+\mathcal{W}+ space instead of W\mathcal{W}, since it produces better reconstruction results and does not spoil editing properties. We set the initial learning rate to 0.10.1 and optimized a w\bm{w} code for LPIPS reconstruction loss for 1000 steps using Adam. For motion codes, we initializated a static sequence and kept it fixed during the optimization process. We noticed that when it is also being optimized, the reconstruction becomes almost perfect, but it breaks when another sequence of motion codes is plugged in.

Editing. Our CLIP editing procedure is very similar to the one in StyleCLIP , with the exception that we embed an image assuming that it is a video frame in location t=0t=0. On each iteration, we resample motion codes since all our edits are semantic and do not refer to motion. We leave the motion editing with CLIP for future exploration. For the sky editing video presented in Fig 2, we additionally utilize masking: we initialize a mask to cover the trees and try not to change them during the optimization using LPIPS loss. For all the videos, presented in the supplementary website, no masking is used.

The details can be found in the provided source code.

B.3 Additional details on positional embeddings

Mitigating high-frequency artifacts. We noticed that if our periods ωt\bm{\omega}_{t} are left unbounded, they might grow to very large values (up to magnitude of 20.0{\approx}20.0), which corresponds to extra high frequencies (the period length becomes less than 4 frames) and leads to temporal aliasing. That’s why we process them via the tanh(ωt)+1\text{tanh}(\bm{\omega}_{t})+1 transform: this bounds them into (0,2)(0,2) range with the mean of 1.0, i.e. using the at-initialization frequency scaling, which we discuss next.

Linearly spaced periods. An important design decision is the scaling of periods since at initialization it should cover both high-frequency and low-frequency details. Existing works use either exponential scaling σ=(2π/2d,2π/2d1,...)\bm{\sigma}=(2\pi/2^{d},2\pi/2^{d-1},...) (e.g., ) or random scaling σN(0,ξI)\bm{\sigma}\sim\mathcal{N}(0,\xi\bm{I}) (e.g., ). In practice, we scale the ii-th column of the amplitudes weight matrix with the value:

where we use ωmax=210\omega_{\text{max}}=2^{10} frames and ωmin=23\omega_{\text{min}}=2^{3} frames in all the experiments, except for SkyTimelapse, for each we use ωmin=28\omega_{\text{min}}=2^{8}. We call this scheme linear scaling and use it as an additional tool to alleviate periodicity since it greatly increases the overall cycle of a positional embedding (see Fig 9). See also the accompanying source code for details.

Another benefit of using our positional embeddings over LSTM is that they are “always stable”, i.e. they are always in a suitable range.

Appendix C Evaluation details

For the practical implementation, see the provided source code: https://github.com/universome/stylegan-v.

In this section, we describe the difficulties of a fair comparison of the FVD score. There are discrepancies between papers in computing even FID . So, it is less surprising that computing FVD for videos diverge even more and has even more implications for methods evaluation.

First, we note that I3D model has different weights on tf.hub https://tfhub.dev/deepmind/i3d-kinetics-400/1 — the model which is used in the official FVD repo.https://github.com/google-research/google-research/blob/master/frechet_video_distance — compared to its official release in the official github repo implementationhttps://github.com/deepmind/kinetics-i3d That’s why we manually exported the weights from tf.hub and used this github repohttps://github.com/hassony2/kinetics_i3d_pytorch to obtain an exact implementation in Pytorch.

There are several issues with FVD metric on its own. First, it does not capture motion collapse, which can be observed by comparing FVD16 and FVD128 scores between StyleGAN-V and StyleGAN-Vwith LSTM motion codes instead of our ones: the latter one has a severe motion collapse issue (see the samples on our website) and has similar or lower FVD128 scores compared to our model: 196.1 or 165.8 (depending on the distance between anchors) vs 197.0 for our model. Another issue with FVD calculation is that it is biased towards image quality. If one trains a good image generator, i.e. a model which is not able to generate any videos at all, then FVD will still be good for it even despite the fact that it would have degenerate motion.

We also want to make a note on how we compute FID for vidoe generators. For this, we generate 2048 videos of 16 frames each (starting with t=0t=0) and use all those frames in the FID computation. In this way, it gives {\approx}33k images to construct the dataset, but those images will have lower diversity compared to a typically utilized 50k-sized set of images from a traditional image generator . The reason of it is that 16 images in a single clip likely share a lot of content. A better strategy would be to generate 50k videos and pick a random frame from each video, but this is too heavy computationally for models which produce frames autoregressively. And using just the first frame in FID computation will unfairly favour MoCoGAN-HD, which generates the very first frame of each video with a freezed StyleGAN2 model.

FVD is greatly influenced by 1) how many clips per video are selected; 2) with which offsets; and 3) at which frame-rate. For example, SkyTimelapse contains several extremely long videos: if we select as many clips as possible from each real video, that it will severely bias the statistics of FVD. For FaceForensics, videos often contain intro frames during their first {\approx}0.5-1.0 seconds, which will affect FVD when a constant offset of is chosen to extract a single clip per video.

That’s why we use the following protocol to compute FVDn.

Computing real statistics. To compute real statistics, we select a single clip per video, chosen at a random offset. We use the actual frame-rate of the dataset, which the model is being trained on, without skipping any frames. The problem of such an approach is that for datasets with small number of long videos (like, FaceForensics, see Table 7) might have noisy estimates. But our results showed that the standard deviations are always <3.0<3.0 even for FaceForensics 2562256^{2}. The largest standard deviation we obserbed was when computing FVD16 on RainbowJelly: on this dataset it was 26.1526.15 for VideoGPT, but it is <1%<1\% of its overall magnitude.

Computing fake statistics. To compute fake statistics, we generate 2048 videos and save them as frames in JPEG format via the Pillow library. We use the quality parameter q=95q=95 for doing this, since it was shown to have very close quality to PNG, but without introducing artifacts that would lead to discrepancies . Ideally, one would like to store frames in the PNG format, but in this case it would be too expensive to represent video datasets: for example, MEAD 102421024^{2} would occupy 0.5{\approx}0.5 terabytes of space in this case.

We illustrate the subtleties of FVD computation in Table 4. For this, we compute real/fake statistics for our model in several different ways:

Resized to 1282128^{2}. Both fake and real statistics images are resized into 1282128^{2} resolution via the pytorch bilinear interpolation (without corners alignment) before computing FVD.

JPG/PNG discrepancy. Instead of saving fake frames in JPG with q=95q=95, we use q=75q=75 parameter in the PIL library. This creates more JPEG-like artifacts, which, for example, FID is very sensitive to.

Using all clips per video. We use all available nn-frames-long clips in each video without overlaps. Note, that our model was trained

Using only first frames. In each real video, instead of using random offsets to select clips, we use the first nn frames.

Using s=8s=8 subsampling. When sampling frames for computing real/fake statistics, we select each 88-th frame. This is the strategy which was employed for some of the experiments in the original paper — but in their case, authors trained the model on videos with this subsampling.

For completeness, we also provide the Inception Score on UCF-101 2562256^{2} dataset in Table 5. Note that is computed by resizing all videos to 112×112112\times 112 spatial resolution (due to the internal structure of the C3D model), which makes it impossible for it to capture high-resolution details of the generated videos, which is the focus of the current work.

In Tab 6, we provide the numbers, used in Fig 3. Note that StyleGAN2 training in our case is slightly slower than the officially specified one (7.3 vs 7.7 GPU days)https://github.com/NVlabs/stylegan2-ada-pytorch, which we attribute to a slightly slower file system on our computational cluster.

Appendix D Failed experiments

In this section, we provide a list of ideas, which we tried to make work, but they didn’t work either because the idea itself is not good, or because we didn’t put enough experimental effort into investigating it.

Hierarchical motion codes. We tried having several layers of motion codes. Each layer has its own distance between the codes. In this way, high-level codes should capture high-level motion and bottom-level codes should represent short local motion patterns. This didn’t improve the scores and didn’t provide any disentanglement of motion information. We believe that the motion should be represented differently (similar to FOMM ), rather than with motion codes, because they make it difficult for G\mathsf{G} to make them temporily coherent.

Maximizing entropy of motion codes to alleviate motion collapse. As an additional tool to alleviate motion collapse, we tried to maximize entropy of wave parameters of our motion codes. The generator solved the task of maximizing the entropy well, but it didn’t affect the motion collapse: it managed to save some coordination dimensions of vt\bm{v}_{t} specifically to synchronize motions.

Prorgressive growing of frequences in positional embeddings. We tried starting with low-frequencies first and progressively open new and new ones during the training. It is a popular strategy for training implicit neural representations on reconstruction tasks (e.g., ), but in our case we found the following problem with it. The generator learned to use low frequencies for representing high-frequency motion and didn’t learn to utilize high frequencies for this task when they became available. That’s why high-frequency motion patterns (like blinking or speaking) were unnaturally slow.

Continuous LSTM with EMA states. Our motion codes use sine/cosine activations, which makes them suffer from periodic artifacts (those artifacts are mitigated by our parametrization, but still present sometimes). We tried to use LSTM, but with exponential moving average on top of its hidden states to smoothen out motion representations temporally. However, (likely due to the lack of experimental effort which we invested into this direction), the resulted motion representations were either too smooth or too sharp (depending on the EMA window size), which resulted in unnatural motions.

Concatenating spatial coordinates. INR-GAN uses spatial positional embeddings and shows that they provide better geometric prior to the model. We tried to use them as well in our experiments, but they didn’t provide any improvement neither in qualitatively, nor quantitatively, but made the training slightly slower (by {\approx}%10) due to the increased channel dimensionalities.

Feature differences in D\mathsf{D}. Another experiment direction which we tried is computing differences between activations of next/previous frames in a video and concatenating this information back to the activations tensor. The intuition was to provide D\mathsf{D} information with some sort of “latent” optical flow information. However, it made D\mathsf{D} too powerful (its loss became smaller than usual) and it started to outpace G\mathsf{G} too much, which decreased the final scores.

Predicting δx\delta^{x} instead of conditioning in D\mathsf{D}. There are two ways to utilize the time information in D\mathsf{D}: as a conditioning signal or as a learning signal. For the latter one, we tried to predict the time distances between frames by training an additional head to predict the class (we treated the problem as classification instead of regression since there is a very limited amount of time distances between frames which D\mathsf{D} sees during its training). However, it noticeably decreased the scores.

Conditioning on video length. For unconditional UCF-101, it might be very important for G\mathsf{G} to know the video length in advance. Because some classes might contain very short clips (like, jumping), while others are very long, and it might be useful for G\mathsf{G} to know in advance which video it will need to generate (since we sample frames at random time locations during training). However, utilizing this conditioning didn’t influence the scores.

Appendix E Datasets details

We provide the dataset statistics in Fig 10 and their comparison in Table 7. Note, that for MEAD, we use only its front camera shots (originally, it releases shots from several camera positions).

E.2 Rainbow Jelly

We noticed that modern video synthesis datasets are either too simple or too difficult in terms of content and motion, and there are no datasets “in-between”. That’s why we introduce RainbowJelly: a dataset of “floating” jellyfish. It is constructed from an 8-hour-long movie in 4K resolution and 30 FPS from the Hoccori Japan youtube video channel. It contains simple content but complex hierarchical motions and this makes it a challenging but approachable test-bed for evaluating modern video generators.

For our RainbowJelly benchmark, we used the following film: https://www.youtube.com/watch?v=P8Bit37hlsQ. We cannot release this dataset due to the copyright restrictions, but we released a full script which processes it (see the provided source code). To construct a benchmark, we sliced it into 1686 chunks of 512 frames each, starting with the 150-th frame (to remove the loading screen), center-cropped and resized into 2562256^{2} resolution. This benchmark is advantageous compared to the existing ones in the following way:

It contains complex hierarchical motions:

a jellyfish flowing in a particular direction (low-frequency global motion);

a jellyfish pushing water with its arms (medium-frequency motion)

small perturbations of jellyfish’s body and tentacles (high-frequency local motion).

It is a very high-quality dataset (4K resolution).

It is simple in terms of content, which makes the benchmark more focused on motions.

Appendix F Implicit assumptions of sparse training

In this section, we elaborate on our simple theoretical exposition from §3.3

Consider that we want to fit a probabilistic model qθ(x)q_{\theta}(\bm{x}) to the real data distribution xp(x)=p(x1,...,xn)\bm{x}\sim p(\bm{x})=p(x_{1},...,x_{n}). For simplicity, we will be considering a discrete finite case, i.e. n<n<\infty, but note that videos, while continuous and infinite in theory, are still discretized and have a time limit to fit on a computer in practice. For fitting the distribution, we use kk-sparse training, i.e. picking only kk random coordinates from each sample xp(x)\bm{x}\sim p(\bm{x}) during the optimization process. In other words, introducing kk-sparse sampling reformulates the problem from

where d(,)d(\cdot,\cdot) is a problem-specific distance function between probability distributions, Ik\mathcal{I}^{k} is a collection of all possible sets I={i1,...,ik}I=\{i_{1},...,i_{k}\} of unique indices ij{1,2,...,n}i_{j}\in\{1,2,...,n\} and xI\bm{x}_{I} denotes a sub-vector (xi1,...,xik)(x_{i_{1}},...,x_{i_{k}}) of x\bm{x}. This means, that instead of bridging together full distributions we choose to bridge all their possible marginals of length kk instead. When solving Eq. (8) will help us to obtain the full joint distribution p(x)p(\bm{x})? To investigate this question, we develop the following simple statement.

Let’s denote by J<ik\mathcal{J}_{<i}^{k} a collection of sets JiJ_{i} of up to kk indices s.t. JiJ<ik\forall J_{i}\in\mathcal{J}_{<i}^{k} we have j<ij<i for all jJij\in J_{i}.

Using the chain rule, we can represent p(x)p(\bm{x}) as:

then p(x)p(\bm{x}) is obviously simplified to:

But we would also like to have the “reverse” dependency, i.e. knowing that if we can approximate the distribution via a set of marginals, then this distribution is not too difficult. For this claim, we will need to consider marginals not of an arbitrary form p(xS)p(\bm{x}_{S}), but of the form p(xi,Ji)p(x_{i},J_{i}), and we would need exactly nn of those. The reverse implication is the following. If p(x)p(\bm{x}) can be represented as a product of nn conditionals p(iJi)p(i|J_{i}), then for each ii there exists JiJ<ikJ_{i}\in\mathcal{J}^{k}_{<i} s.t. p(xixi)=p(xiJi)p(x_{i}|\bm{x}_{i})=p(x_{i}|J_{i}). This statement, just like the previous one, looks obvious. But oddly, requires more than a single sentence to prove. First, we are given that:

but unfortunately, we cannot directly claim that each term in the product i=1np(xix<i)\prod_{i=1}^{n}p(x_{i}|\bm{x}_{<i}) equals to its corresponding one in the product i=1np(xixJi)\prod_{i=1}^{n}p(x_{i}|\bm{x}_{J_{i}}). For this, we first need to show that for each mm we have:

This allows to cancel terms in the chain rule one by one, starting from the end, leading to the desired equality:

Does this reverse claim tells us anything useful? Surprisingly again, yes. It implies that if we managed to fit p(x)p(\bm{x}) by using kk-sparse training, then this distribution is not sophisticated.

Merging the above two statements together, we see that p(x)p(\bm{x}) can be represented as a product of nn conditionals p(xixJi)p(x_{i}|\bm{x}_{J_{i}}) for i=1,...,ni=1,...,n if and only if for all ini\leqslant n there exists JiJ<ik1J_{i}\in\mathcal{J}_{<i}^{k-1} s.t. p(xix<i)p(xixJi)p(x_{i}|\bm{x}_{<i})\equiv p(x_{i}|\bm{x}_{J_{i}}).

Appendix G Additional samples

For the ease of visualization, we provide additional samples of the model via a web page: https://universome.github.io/stylegan-v.

Appendix H Comparison to DIGAN

Our model shares a lot of similarities to DIGAN and in this section we highlight those similarities and differences.

Sparse training. DIGAN also utilizes very sparse training (only 2 frames per video). But in our case, we additionally explore the optimal number of frames per video kk (see §3.3).

Continuous-time generator. DIGAN also builds a generator, which is continuous in time. But our generator does not lose the quality at infinitely large lengths.

Dropping conv3d blocks. DIGAN also drops conv3d blocks in their discriminator. But in contrast to us, they still have 2 discriminators.

H.2 Major differences

Motion representation. DIGAN uses only a single global motion code, which makes it theoretically impossible to generate infinite videos: at some point it will start repeating itself (due to the usage of sine/cosine-based positional embeddings). In our case, we use an infinite sequence of motion codes, which are being temporally interpolated, computed wave parameters from and transformed into motion codes. DIGAN mixes temporal and spatial information together into the same positional embedding, which creates the following problem: even when time changes, the spatial location, perceived by the model, also changes. This creates a “head-flying-away” effect (see the samples). In our case, we keep these two information sources decomposed from one another.

Generator’s backbone. DIGAN is built on top of INR-GAN , while our work uses StyleGAN2. This allows DIGAN to inherit INR-GAN’s benefits from being spatially continuous, but at the expense of being less stable and being slower to train (due to the lack of mixed precision and increased channel dimensionalities from concatenating positional embeddings).

Discriminator structure. DIGAN uses two discriminators: the first one operates on image-level and is equivalent to StyleGAN2’s one, while the other one operates on “video” level and takes frames xt1,xt2\bm{x}_{t_{1}},\bm{x}_{t_{2}} and the time differences between them Δ=t2t1\Delta=t_{2}-t_{1}, concatenates them all together into a 7-channel input image (tiling the time difference scalar) and passes into a model with StyleGAN2 discriminator’s backbone. In our case, we use concatenate the frames features and apply the conditioning via the projection discriminator strategy.

Sampling procedure. We use k=3k=3 samples per video, while DIGAN uses k=2k=2. Also, we sample frames uniformly randomly, while DIGAN selects t1Beta(2,1)t_{1}\sim\text{Beta}(2,1) and t1Beta(1,2)t_{1}\sim\text{Beta}(1,2) (in this way, DIGAN sometimes have t1>t2t_{1}>t_{2}). Apart from that, they use T=16T=16.

Apart from those major distinctions, there are lot of small implementation differences. We refer an interested reader to the released codebases for them:

StyleGAN-V: https://github.com/universome/stylegan-v

DIGAN: https://openreview.net/forum?id=Czsdv-S4-w9

H.3 A note on the computational cost

INR-GAN demonstrated that it has higher throughput than StyleGAN2 in terms of images/second . But the authors compare to the original StyleGAN2 implementation and not to the one from StyleGAN2-ADA repo, which is much better optimized. Also, they use caching of positional embeddings which is only possible at test-time and has great influence on its computational performance. In this way, we found that that StyleGAN2 is 2{\approx}2 times faster to train and is less consuming in terms of GPU memory than INR-GAN.

DIGAN is based on top of INR-GAN and that’s why suffers from the issues described above. We trained it for a week on ×4\times 4 v100 NVidia GPUs and observed that it stopped improving after 5{\approx}5 days of training. This is equivalent to 20k{\approx}20k real frames seen by the discriminator (while MoCoGAN+SG2 and StyleGAN-V reach 25k{\approx}25k in just 2 days for the same resolution in the same environment). For the time of the submitting the main paper, there was no information about the training cost. However, the authors updated their manuscript for the time of submitting the supplementary and specify the training cost of 8 GPU-days 1282128^{2} resolution, which is consistent with our experiments (considering that we have twice as larger resolution).