Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, Hongyang Li

Introduction

Driven by scalable learning techniques, autonomous driving has made encouraging strides over the past few years . However, intricate and out-of-distribution situations are still intractable for state-of-the-art techniques . One promising solution lies in world models , which infer the possible future states of the world from historical observations and alternative actions, in turn assessing the feasibility of such actions. They hold the potential to reason with uncertainty and avoid catastrophic errors , thereby promoting generalization and safety in autonomous driving.

Although a primary prospect of world models is to enable the generalization ability to novel environments, existing driving world models are still constrained by data scale and geographical coverage . As summarized in Table 1 and Fig. 1, they are also often confined to low frame rates and resolutions, resulting in a loss of critical details. Furthermore, most models only support a single control modality such as the steering angle and speed. This is insufficient to express various action formats ranging from high-level intentions to low-level maneuvers, and incompatible with the outputs of prevalent planning algorithms . In addition, generalizing action controllability to unseen datasets is understudied. These limitations impede the applicability of existing works, making it imperative to develop a world model that overcomes these limitations.

To this end, we introduce Vista, a driving world model that is proficient in cross-domain generalization, high-fidelity prediction, and multi-modal action controllability. Specifically, we develop the predictive model on a large corpus of worldwide driving videos to foster its generalization ability. To enable coherent future extrapolation, we condition Vista on three essential dynamic priors (Sec. 3.1). Instead of solely relying on the standard diffusion loss , we introduce two explicit loss functions to enhance dynamics and preserve structural details (Sec. 3.1), promoting Vista’s ability to simulate realistic futures at high resolution. For flexible controllability, we incorporate a versatile set of action formats, including both high-level intentions such as commands and goal points, as well as low-level maneuvers like trajectories, steering angles, and speeds. These action conditions are injected via a unified interface, which is learned through an efficient training strategy (Sec. 3.2). Consequently, as Fig. 2 shows, Vista acquires the ability to anticipate realistic futures at 10 Hz and 576 $\times$ 1024 pixels, and obtains versatile action controllability across various levels of granularity. We also demonstrate the potential of Vista as a generalizable reward function to evaluate the reliability of different actions.

Our contributions are three-fold: (1) We present Vista, a generalizable driving world model that can predict realistic futures at high spatiotemporal resolution. Its prediction fidelity is greatly improved by two novel losses that capture dynamics and preserve structures, along with exhaustive dynamic priors to sustain consistency in long-horizon rollouts. (2) Propelled by an efficient learning strategy, we integrate versatile action controllability into Vista through a unified conditioning interface. The action controllability of Vista can also generalize to different domains in a zero-shot manner. (3) We conduct comprehensive experiments across multiple datasets to verify the effectiveness of Vista. It outperforms the most competitive general-purpose video generator and sets a new state-of-the-art on nuScenes. Our empirical evidence shows that Vista can be used as a reward function to assess actions.

Preliminary

Despite the high aesthetic quality, SVD lacks several key properties to function as a driving world model. As shown in Sec. 4, the prediction of SVD does not commence from the condition image, making it misaligned at the beginning and impractical for long-term extension. In addition, SVD struggles with the intricate dynamics of driving scenarios, entailing implausible motions. Moreover, SVD cannot be controlled by any action format. To this end, we aim to build a generalizable driving world model that predicts high-fidelity futures with realistic dynamics. It ought to be continuously extendable to long horizons and flexibly controllable by multi-modal actions as illustrated in Fig. 2.

Learning a Generalizable Driving World Model

As depicted in Fig. 3, Vista adopts a two-phase training pipeline. First, we build a dedicated predictive model, which involves a latent replacement approach to enable coherent future prediction and two novel losses to enhance fidelity (Sec. 3.1). To ensure the generalization to unseen scenarios, we utilize the largest public driving dataset for training. In the second phase, we incorporate multi-modal actions to learn action controllability with an efficient and collaborative training strategy (Sec. 3.2). Using the ability of Vista, we further introduce a generalizable approach to evaluate actions (Sec. 3.3).

Basic Setup. Since world models are initiated to predict futures from the current state, the starting of their prediction should be firmly aligned with the condition image. Therefore, we tailor SVD into a dedicated predictive model by imposing the first frame as the condition image and discarding the noise augmentation during training. With this prediction ability, Vista can perform long-term rollouts by iteratively predicting short-term clips and resetting the condition image with the last clip.

Dynamic Prior Injection. Nevertheless, using the aforementioned setup for training often results in irrational dynamics with respect to historical frames, especially in long-term rollouts. We conjecture that this mainly arises from the ambiguity caused by insufficient priors about the tendency of future motions, which is also a common limitation of existing driving world models .

Estimating coherent futures requires at least three essential priors that inherently govern the future motion of instances in the scene: position, velocity, and acceleration. Since velocity and acceleration are the first- and second-order derivative of position respectively, these priors can be entirely derived by using three consecutive frames for conditioning. Concretely, we build a frame-wise mask $\boldsymbol{m}\in\{0,1\}^{K}$ with a length of $K$ to indicate the presence of condition frames. The mask is set sequentially following the time order, with at most three elements being assigned as $1$ to denote three condition frames. Instead of concatenating additional channels to the inputs, we inject new condition frames by replacing the corresponding noisy latent $n_{i}$ with the clean latent $z_{i}$ encoded by the image encoder. Formally, the input latent is constructed as $\hat{\boldsymbol{n}}=\boldsymbol{m}\cdot\boldsymbol{z}+(1-\boldsymbol{m})\cdot\boldsymbol{n}$ (see Fig. 3 [Left]). To discern the clean latent, we duplicate a new timestep embedding from the pretrained weights and allocate it to the condition frames according to $\boldsymbol{m}$ . The timestep embeddings for condition frames and prediction frames are trained separately. Compared to channel-wise concatenation, we find that replacing the latent is more effective and flexible in absorbing varying numbers of condition frames. In addition, we observe that the replaced latent, when applied to SVD directly, does not degrade its generation quality. Thus, the original performance will not be disturbed when the training is launched. Since there is no need to predict the observed condition frames, we exclude them from the loss as follows:

where $D_{\theta}$ is the UNet denoiser that shares the same architecture with SVD. With the replaced latent holding sufficient priors, Vista can fully capture the status of the surrounding instances and predict more coherent and plausible long-term futures through iterative rollouts. In practice, we leverage the last three frames of a predicted clip as dynamic priors for the next prediction step during rollouts.

Dynamics Enhancement Loss. Unlike general videos that cover rather small spaces, driving videos encompass much larger scenes. In most driving videos, distant and monotonous regions dominate the view, with the moving instances in the foreground only occupying a relatively small area. However, the latter often exhibit higher stochasticity, complicating their prediction. Since Eq. 1 supervises all outputs uniformly, it cannot effectively discriminate the nuances of different regions as Fig. 4(b) shows. As a result, the model cannot efficiently learn to predict realistic dynamics in crucial regions.

For numerical stability, we normalize $\boldsymbol{w}$ within each video clip. As shown in Fig. 4(c), the weight amplifies the presence of large motion disparities, highlighting dynamic regions while excluding monotonous backgrounds. Given the causality of future prediction, i.e. subsequent frames ought to follow previous ones, we define a new loss by penalizing the latter frame of each adjacent frame pair:

where $\texttt{sg}(\cdot)$ stops the gradient. By adaptively re-weighting the standard diffusion loss, $\mathcal{L}_{\text{dynamics}}$ can boost the learning efficiency of dynamic regions, e.g., the moving vehicles and sidewalks in Fig. 4(d).

Structure Preservation Loss. The trade-off between perceptual quality and motion intensity has been widely acknowledged in video generation , and our case is no exception. When it comes to high-resolution prediction for dynamic driving scenarios, we discover that the predicted structural details degrade severely with over-smoothed or broken objects, e.g., the outlines of vehicles unravel quickly as they move (see Fig. 12). To alleviate this problem, it is important to place more emphasis on structural details. Based on the fact that structural details, such as edges and textures, mainly reside in high-frequency components, we identify them in the frequency domain as follows:

where FFT and IFFT are the 2D discrete Fourier transform and inverse discrete Fourier transform respectively, and $\mathcal{H}$ is an ideal 2D high-pass filter that truncates low-frequency components under a certain threshold. The Fourier transforms are applied on each channel of $z_{i}$ independently. As illustrated in Fig. 4(e), features associated with structural information can be effectively emphasized by Eq. 4. The corresponding features from the predicted latent $D_{\theta}(\hat{n}_{i};\sigma)$ can also be extracted similarly. With the extracted high-frequency features, we devise a new structure preservation loss as:

This loss function minimizes the disparity of high-frequency features between prediction and ground truth, so that more structural information can be retained. Our final training objective is a weighted sum of Eq. 1, Eq. 3 and Eq. 5, where $\lambda_{1}$ and $\lambda_{2}$ are trade-off weights to balance the optimization:

2 Phase Two: Learning Versatile Action Controllability

Unified Conditioning of Versatile Actions. To maximize usage flexibility, a driving world model should be able to leverage multiple action formats with different characteristics. For instance, one may use the world model to evalute high-level policies , or to execute low-level maneuvers . However, existing approaches only support limited action controls , inhibiting their flexibility and applicability. Therefore, we incorporate a versatile set of action modes for Vista: (1) Angle and Speed stand for the utmost fine-grained action controls. We normalize angles to $ $and represent speeds in$ km/h$. The signals from different timestamps are concatenated sequentially. (2) Trajectory is a series of 2D displacements in ego coordinates. It is widely used as the output of planning algorithms . We represent the trajectory in meters and flatten it into a sequence. (3) Command is the most high-level intention. Without loss of generality, we define four commands, i.e. go forward, turn right, turn left, and stop, which are implemented as categorical indices. (4) Goal Point is a 2D coordinate projected from the short-term ego destination onto the initial frame, serving as an interactive interface . The coordinate is normalized by the image size.

Note that these actions are heterogeneous and cannot be used interchangeably. After transforming all these actions into numerical sequences, we encode them as a unified concatenation of Fourier embeddings (see Fig. 3). These embeddings can be jointly ingested by learning additional projections to expand the input dimension of the cross-attention layers in the UNet . The new projections are initialized as zeros to enable gradual learning from the pretrained state. We empirically discover that incorporating action conditions through cross-attention layers yields faster convergence and stronger controllability compared to other approaches such as additive embeddings .

Efficient Learning. We learn action controllability after the first training phase. Since the number of total iterations is crucial for diffusion training , we separate action control learning into two stages. In the first stage, we train our model at a low resolution (320 $\times$ 576), which achieves 3.5 $\times$ higher training throughput compared to the original resolution (576 $\times$ 1024). This stage constitutes the majority of training iterations. Then, we finetune the model at the desired resolution (576 $\times$ 1024) for a short duration, so that the learned controllability can effectively cater to high-resolution prediction.

However, tuning the UNet at a lower resolution directly may undermine the high-fidelity prediction ability. Conversely, freezing all UNet weights and training the new projections alone would precipitate a quality decline (see Appendix D), suggesting the necessity to make the UNet adaptable. To solve this, we freeze the pretrained UNet and introduce parameter-efficient LoRA adapters to each attention layer. After training, the low-rank matrices can be seamlessly integrated with the frozen weights, without introducing extra inference latency. Thus, the pretrained weights remain intact when training at the low resolution, avoiding deterioration of the pretrained high-fidelity prediction ability.

Since different action formats cannot be converted into one another, it is impractical to apply multiple equivalent action conditions simultaneously at inference time. Additionally, it is burdensome to attain hybrid action controllability, which entails prohibitively expensive training to encompass all possible combinations. Thus, unlike common practices that activate all conditions during training, we enforce the independence of different action formats by enabling only one of them for each training sample. The remaining action conditions will be filled with zeros as unconditional inputs. As demonstrated in Appendix D, this simple constraint prevents the squandering of training cost on action combinations and maximizes the learning efficiency of each individual action mode within the same training steps.

Collaborative Training. Note that the aforementioned action conditions are not available in OpenDV-YouTube . On the other hand, nuScenes has adequate annotations to derive these conditions. To maintain generalization and learn controllability in tandem, we introduce a collaborative training strategy by utilizing the samples from both datasets, with the action conditions for OpenDV-YouTube set to zero. The action control learning phase adopts the same loss as Eq. 6. By learning from two complementary datasets, Vista gains versatile controllability that are generalizable to novel datasets.

3 Generalizable Reward Function

One application of world models is to evaluate actions by engaging a reward module . Drive-WM establishes a reward using external detectors . However, these detectors are developed on a particular dataset , which may become a bottleneck for reward estimation in arbitrary scenarios. On the other hand, Vista has ingested millions of human driving logs, exhibiting strong generalization across scenes. Based on the observation that out-of-distribution conditions will lead to increased diversity in generation , we utilize the prediction uncertainty from Vista itself as the source of our reward. Different from Drive-WM, our reward function seamlessly inherits the generalization of Vista without resorting to external models. Specifically, we estimate uncertainty via conditional variance. For reliable approximation, we denoise from randomly sampled noise with the same condition frame $\boldsymbol{c}$ and action $\boldsymbol{a}$ for $M$ rounds. Our reward function $R(\boldsymbol{c},\boldsymbol{a})$ is then defined as the exponential of averaged negative conditional variance:

where $\texttt{avg}(\cdot)$ averages all latent values within the video clip. Based on this formulation, unfavorable actions with larger uncertainties will lead to lower rewards. In contrast to commonly used evaluation protocols (e.g., the L2 error), our reward function can evaluate actions without referring to the ground truth actions. Note that we do not normalize the estimated rewards for the simplicity of definition, but it is straightforward to amplify the relative contrast by rescaling the estimated rewards with a factor.

Experiments

In this section, we first demonstrate Vista’s strengths in generalization and fidelity in Sec. 4.1. We then show the impact of action controls in Sec. 4.2. We also substantiate the efficacy of the proposed reward function in Sec. 4.3. Finally, we conduct ablation studies on our key designs in Sec. 4.4. For more implementation details and experimental results, please refer to Appendix C and Appendix D.

Automatic Evaluation. Since none of the driving world models are publicly accessible, we compare these methods with their quantitative results on nuScenes. Table 2 reports FID and FVD scores of all methods. Vista exceeds state-of-the-art driving world models with a non-trivial margin.

Human Evaluation. To analyze the generalization of Vista across different datasets, we compare it against three prominent general-purpose video generators trained on web-scale data (see Fig. 5). It is known that automatic metrics like FVD cannot conclusively reveal perceptual quality , let alone real-world dynamics. Therefore, we opt for human evaluation for more faithful analysis. Following recent advances , we adopt the Two-Alternative Forced Choice protocol. Specifically, participants are presented with a side-by-side video pair and asked to choose the video they deemed better on two orthogonal aspects: visual quality and motion rationality. To avoid potential bias, we crop each video to a fixed aspect ratio, downsample them to the same resolution, and trim the excess frames when Vista generates longer videos than others. We only feed one condition frame to align with other models. To ensure the variety of scenes, we uniformly assemble 60 scenes from four representative datasets: OpenDV-YouTube-val , nuScenes , Waymo , and CODA . These datasets collectively exemplify the intricacy and diversity of real-world driving, e.g., OpenDV-YouTube-val includes geofenced districts, Waymo offers a unique domain compared to our training data, and CODA contains extremely challenging corner cases. We collect a total of 2640 answers from 33 participants. As presented in Fig. 8, Vista outperforms all baselines on both aspects, demonstrating its profound comprehension of the driving dynamics. Further, unlike other models that are only applicable for short-term generation, Vista can accommodate more dynamics priors and produce coherent long-horizon rollouts as shown in Fig. 6.

2 Results of Action Controllability

Quantitative Results. To evaluate the impact of action controls, we divide the validation set of both nuScenes and the unseen Waymo dataset into four subsets according to our command categories. We then generate predictions using different modalities of the ground truth actions. The FVD score is measured on each subset and then averaged. A lower FVD score reflects a closer distribution to the ground truth videos, indicating that the predictions exhibit more resemblance to each specific type of behavior. Fig. 8 shows that our action controls can emulate the corresponding movements effectively.

Qualitative Results. Fig. 9 exhibits the versatile action controllability of our model. Vista can be effectively controlled by multi-modal actions, even in unseen scenarios beyond the training domain. In Appendix E, we also showcase the counterfactual reasoning ability of Vista using abnormal actions.

3 Results of Reward Modeling

To validate the efficacy of our reward function, we jitter the ground truth trajectories into a series of inferior trajectories. Specifically, we compute the standard deviation of each waypoint from the nuScenes training set as prior distributions. These priors are jointly rescaled to sample perturbations with different L2 errors. The perturbations are then added as offsets to the ground truth trajectories. To ensure the plausibility of sampled trajectories, we adopt an explicit correlating strategy to regularize offset sampling and recursively sample new trajectories until their offsets are consistent in tendencies. To demonstrate the generality of our reward function, we conduct reward estimation on Waymo , which is unseen in training. This is done by uniformly sampling from each command category on Waymo validation set, resulting in 1500 cases in total. We compare the average reward of the trajectories with varying L2 errors in Fig. 11. Our reward decreases when the deviation from the ground truth increases, underscoring the potential of our approach to serve as a viable reward function. It also holds the promise to remedy the irrationality in current evaluation protocols for planning , such as the L2 error shown in Fig. 11. More in-depth analysis of rewards, including sensitivity to hyperparameters and reward of other actions, are provided in Appendix D.

4 Ablation Study

Dynamic Priors. We visualize the outcomes of applying different orders of dynamic priors in Fig. 11. It shows that dynamic priors play a pivotal role in long-horizon rollouts, where maintaining coherence with respect to historical frames is essential.

Auxiliary Supervisions. To verify the effectiveness of the proposed losses, we devise two additional variants by individually ablating each loss from a variant that incorporates both losses. We compare their effects through Fig. 12, which confirms that the dynamics enhancement loss can promote the learning of real-world dynamics, and the structure preservation loss can reinforce structural details.

Conclusion

In this paper, we introduce Vista, a generalizable driving world model with enhanced fidelity and controllability. Based on our systematic investigations, Vista is able to predict realistic and continuous futures at high spatiotemporal resolution. It also possesses versatile action controllability that is generalizable to unseen scenarios. Moreover, it can be formulated as a reward function to evaluate actions. We hope Vista will usher in broader interest in developing generalizable autonomy systems.

Limitations and future work. As an early endeavor, Vista still exhibits some limitations with respect to computation efficiency, quality maintenance, and training scale. Our future work will look into applying our method to scalable architectures . More discussions are included in Appendix A.

Acknowledgments

This work is supported by National Key R&D Program of China (2022ZD0160104), National Natural Science Foundation of China (62206172), and Shanghai Committee of Science and Technology (23YF1462000). This work is also partially supported by the BMBF (Tübingen AI Center, FKZ: 01IS18039A), the DFG (SFB 1233, TP 17, project number: 276693517), and the EXC (number 2064/1 – project number: 390727645). We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Kashyap Chitta. We also appreciate Zetong Yang, Chonghao Sima, Linyan Huang, and the rest members from OpenDriveLab for valuable feedback. We express our sincere gratitude to all anonymous participants for helping with the human evaluation.

References

Appendix A Discussions

To help a thorough understanding of this work, we discuss intuitive questions that might be raised.

Q1. Why is at least position, velocity, and acceleration required to predict coherent futures?

Position ensures the predicted future begins continuously with the current state. Velocity manifests how objects are moving, e.g., whether they are turning left or turning right. Acceleration represents how velocity changes over time, e.g., whether the surroundings are moving faster or moving slower. Without utilizing acceleration as a cue, a car overtaking the ego-vehicle may suddenly be passed by in the next autoregressive prediction step. These three priors provide essential cues to allow consistent future extension with respect to historical observation.

Q2. How is the form of the proposed reward function defined?

Unlike VIPER and Diffusion Reward that both make discrete predictions, our model predicts continuous latent. Therefore, our reward is estimated according to conditional variance rather than log-likelihood or entropy. In addition, measuring uncertainty with log-likelihood requires comparing the prediction to the ground truth. As we deploy the reward in any scenario, the approach of VIPER is infeasible for our objective. Note that our reward calculation is meticulously designed to satisfy the Kolmogorov axioms, i.e. it is non-negative and the measure of the entire sample space is $$.

Q3. Reward estimation efficiency compared to the detector-based method .

Though our reward estimation involves multi-round denoising, it will not spend more compute than the detector-based reward function defined in Drive-WM . To be specific, Drive-WM obtains the rewards from the perception results. Given that the detectors take image sequences as inputs, Drive-WM has to accomplish all steps of the denoising process before perception. Differently, our reward function estimates the reward with the uncertainty that originates from the world model itself without relying on other perception models. Therefore, the estimation of uncertainty does not require completing the generation process. It can be realized by only denoising each sample for a few steps. In fact, as specified in Appendix C, the total computation required for reward estimation per situation (10 steps, 5 rounds) is no greater than that of generating the entire video (50 steps for our model) as Drive-WM does. Note that the computational cost for our reward estimation can be flexibly reduced to further improve its efficiency. As shown in Fig. 13, using 5 denoising steps ( $50\%$ of the default computation) also yields satisfactory estimations of the reward.

Q4. Usage of the proposed reward function.

(1) As mentioned in Sec. 4.3, the proposed reward function can potentially serve as an alternative metric of driving actions that mitigate the concerns in existing open-loop evaluation . (2) As demonstrated in Fig. 11, better actions generally yield higher rewards with our reward function. Taking advantage of this property, there is great promise for our reward function to be used as a critic module , which enables model-predictive control by executing the optimal action that maximizes the estimated reward . This procedure can be performed in conjunction with distribution-based planners that can make action proposals to reduce the searching space.

Q5. Any other potential applications of Vista?

(1) As a generalizable predictive model, Vista could be utilized as a forward dynamics model to simulate short-term dynamics and assist long-horizon planning tasks like visual navigation . (2) It is also intriguing to utilize Vista as an implicit driving policy, which is spontaneously acquired through future prediction . After synthesizing the video plan, we can convert the resultant image trajectory to executable actions by a non-causal inverse dynamics model , which can be efficiently learned from suboptimal data in contrast to the imitation learning pipeline . In autonomous driving, the inverse dynamics model could be implemented with visual odometry . (3) In collaboration with the reward function, it is also worth investigating if Vista could facilitate model-based reinforcement learning by boosting the sampling efficiency in real-world scenarios .

These two works have fundamental differences in control versatility and prediction fidelity. First of all, Vista is a generalizable world model that can be controlled by multi-modal action conditions. Although GenAD has also trained a trajectory-conditioned extension, its weights are fully fine-tuned on nuScenes and the generalization of its action control has never been validated. In contrast, Vista integrates versatile action controllability that can generalize to new scenarios in a zero-shot manner. Unlike GenAD that requires labeling OpenDV-YouTube with commands and texts, our collaborative training strategy skillfully averts this labor that may incur accumulated noises and conflicts . In addition, Vista (10 Hz, 576 $\times$ 1024) operates at much higher frame rate and resolution, considerably beyond the capability of GenAD (2 Hz, 256 $\times$ 448) in both temporal and spatial axes. Different from GenAD, we also put forth several dedicated designs for high-fidelity prediction. We find that Vista, with a lower model complexity, achieves much better FID and FVD scores than GenAD (see Table 2).

Q7. Limitations, failure cases, and possible solutions.

As one of the pioneering efforts, Vista still has a few limitations that call for future works. (1) Since Vista predicts futures at an exceptional spatiotemporal resolution, it is inevitable to be computationally expensive, particularly in downstream applications. Potential solutions may include faster sampling methods and training-based distillations . (2) It is possible that the prediction may undergo an apparent degradation during long-horizon rollouts or drastic view shifts. Extra refinements on the prediction results could be helpful. Speculatively, applying our recipe to more scalable architecture is also promising to address this limitation. (3) Similar to other controllable video generation methods , the chance of failure still persists in our action controls, especially for ambiguous intentions such as commands and goal points as Fig. 8 reveals. Incorporating more datasets with action annotations for collaborative training could be beneficial. Using compositional classifier-free guidance to amplify the individual impact of action conditions may also help (at a cost of increased inference compute). (4) Although our training data is based on the largest public driving dataset , it is nowhere near the entirety of Internet driving data, thus leaving huge untapped potential to further expand the capabilities of Vista.

Despite the encouraging improvements, our work is by no means perfect when it comes to real-world applications that involve dealing with highly complicated situations. As Vista is based on the diffusion framework, which introduces stochastic outcomes and non-negligible latencies, deploying it into autonomous vehicles directly could pose safety risks. While it is not a silver bullet yet, we expect that Vista will inspire the community to further exploit the capabilities and applications of driving world models. As a prototype for generalizable driving world models, we hope that Vista can stimulate the investigations in developing generalizable systems for autonomous driving and machine intelligence.

Appendix B Related Work

Intelligent agents should be able to make efficacious decisions even under unseen situations . This requires fundamental knowledge of the world that generalizes to rare cases. As an internal manifestation of such knowledge, a world model predicts plausible futures of the world given potential actions . In principle, it not only predicts how the environment will unfold over time, but also deduces the underlying physical dynamics and agentic behaviors. Such properties can be useful for representation learning , model-based reinforcement learning , as well as model-predictive control . Recent works also induce language-based world models from large language models, but are restricted in textual space and struggle with grounding on physics .

Although world models have been extensively applied and made significant revolutions in simulated games and indoor embodiment , such investigations for autonomous driving still lag behind . Different from other tasks, world modeling for autonomous driving poses unique challenges, which primarily arise from the large field of views with highly dynamic motions. Some practices imagine the world in the bird’s eye view (BEV) space . Recent practices model the world state as raw sensor observations such as point clouds and images . The latter category, namely visual world models, hold more promise for scaling up due to sensor flexibility and data accessibility. Nevertheless, existing methods are still restricted to a particular dataset or simulator , compromising their generalization ability to novel domains. Meanwhile, these efforts lack systematic designs for the driving domain and only model the world at relatively low frame rates and resolutions, which discards the fine-grained details and impairs their ability to express real-world behaviors. Moreover, most of them are restricted to a specific control modality , which hinders the accommodation to prevailing planning algorithms and extension to more applications like decision-making or user interaction . Besides, existing methods seldom explore zero-shot action controllability across different datasets. The inferior generalization, fidelity and controllability collectively preclude existing driving world models from broadly facilitating the development of autonomous driving.

B.2 Video Generation

Video generation is an effective way to model the world and has undergone remarkable advancements over the years. Pioneering works have studied various kinds of generative models . Swayed by the success of diffusion models , a surge of diffusion-based video generation methods have emerged . Recent works shift their focus towards image-to-video generation for its finer content description and better scalability in training data. However, most of them are not strict predictive models that generate videos starting from the condition image. Moreover, existing methods struggle with the intricate dynamics in driving scenarios from the ego perspective , which limits their feasibility as driving world models.

While the majority of existing methods produce videos without explicit controllability, two recent works introduce camera motion control to video generation. However, camera motion is conceptually distinct from vehicle actions and both of these works are text-to-video methods without any prediction ability. Contrarily, the model we developed is a predictive world model that produces realistic dynamics and allows versatile action controls for autonomous driving.

Appendix C Implementation Details

We adopt the framework of SVD as the architecture of Vista, which consists of 2.5B parameters in total, including 1.6B UNet parameters. For action conditioning, we encode the value of each action sequence into Fourier embeddings with 128 channels.

C.2 Dataset

We utilize a rigorously filtered set of OpenDV-YouTube for training, and incorporate nuScenes training set during the action control learning phase. Concretely, we manually eliminate 15 hours of irrelevant content from OpenDV-YouTube, yielding approximately 1735 hours of unlabeled driving videos. Since nuScenes is heavily biased , we balance its samples based on command categories to foster the learning of rare actions. The video clips are sampled with 25 frames at 10 Hz. Although nuScenes is logged at 12 Hz, we find no negative impact of treating them as 10 Hz videos. The model inputs are composed by cropping and resizing these clips to the target resolution. To categorize actions into commands, we follow the established conventions in planning and define the command of ego-vehicle as "turn right" or "turn left" when its displacement exceeds 2 meters in the orthogonal direction relative to its initial heading. To allow more precise categorization, we additionally introduce a "stop" command when the forward driving distance is less than 2 meters.

C.3 Training

At the first training phase, we train all UNet parameters at 576 $\times$ 1024 resolution on 128 A100 GPUs for 20K iterations, which takes about 8 days in total. We accumulate the gradients of 2 steps, yielding an effective batch size of 256. Following SVD, our model is trained with the EDM framework . We use the AdamW optimizer with a learning rate of $1\times 10^{-5}$ . The learning rate for spatial layers is moderated by a discount factor of $0.1$ . The coefficients $\lambda_{1}$ and $\lambda_{2}$ in Eq. 6 are set to $1.0$ and $0.1$ respectively. Offset noise is also used with a strength of $0.02$ as it helps improve temporal smoothness. We randomly sample different orders of dynamic priors with increasing probabilities, i.e. $\nicefrac{{1}}{{15}}$ , $\nicefrac{{2}}{{15}}$ , $\nicefrac{{4}}{{15}}$ , $\nicefrac{{8}}{{15}}$ for 0, 1, 2, 3 condition frames respectively. The noise augmentation is disabled to retain more details from the condition frames.

As for the action control learning phase, we freeze the pretrained weights and add LoRA and projection layers to all attention blocks of the UNet. The rank of LoRA is set to 16. We then train the new weights at 320 $\times$ 576 resolution for 120K iterations using batch size 8 and learning rate $5\times 10^{-5}$ . After the controllability can be clearly witnessed, we continue to finetune the unfrozen weights at 576 $\times$ 1024 resolution for another 10K iterations. We drop out each activated action mode with a ratio of $15\%$ to allow classifier-free guidance . The sampling ratio of OpenDV-YouTube and nuScenes is $1:1$ at this training phase. The whole training process for action controllability takes around 10 days on 8 A100 GPUs, with roughly 8 days at the low resolution and 2 days at the high resolution.

C.4 Sampling

We generate new videos using the DDIM sampler for 50 steps. The sampling starts with $\sigma_{\text{max}}$ at $700.0$ . Unlike SVD that linearly increases the guidance scale, we employ a triangular classifier-free guidance scheme to permit genuine long-horizon rollouts. Specifically, for the $i$ -th frame in each $K$ frames to predict, we assign its guidance scale $s(i)$ following:

where $s_{\text{min}}$ and $s_{\text{max}}$ denote the minimum and maximum scales along the frame axis. In our experiments, we define $s_{\text{min}}$ as $1.0$ and $s_{\text{max}}$ as $2.5$ . As illustrated in Fig. 14, this technique effectively relieves the saturation drift problem while enhancing details. To improve perceptual continuity, we split the generated latent into clips with an overlap of 3 frames before sending them to the video-aware decoder . The overlapped frames are averaged pixel-wise after decoding.

C.5 Human Evaluation

Recall that we ask the participants to judge side-by-side video pairs from visual quality and motion rationality. To guarantee credible responses, we provide detailed commentary for each aspect of the human evaluation. For visual quality, we let the participants focus on the consistency and harmony of the generated content. For motion rationality, we encourage the participants to pay more attention to the plausibility of the ways that ego-vehicle and other agents move, e.g., whether they are following the traffic rules and exhibiting safe behaviors. For all public models we compared, we use their official configurations for inference and set the prompt as "realistic drive view" if required .

C.6 Reward Estimation

For each condition frame and action pair, we accumulate an ensemble with size $M=5$ to obtain a reliable uncertainty estimation. Each sample in the ensemble is inferred for 10 denoising steps as we find it is unnecessary to generate high-quality results for uncertainty estimation. The coefficient $\beta$ in the correlating strategy is set to 0.5.

C.7 Ablation Studies

For the ablation of loss functions, we train each variant on OpenDV-YouTube for 10K steps at a spatial resolution of 576 $\times$ 1024. All ablations, including the additional ablations in Appendix D, are initialized by loading the pretrained weights of SVD and conducted on 8 A100 GPUs.

Appendix D Additional Experiments

Sensitivity to Hyperparameters. To investigate how the number of denoising steps and the ensemble size influence the performance of the proposed reward function, we repeat the reward estimation procedure in Sec. 4.3 with different hyperparameter settings. We start off by using 5 denoising steps and an ensemble size of 5 for each situation. We then add two variants that double the computational cost by increasing the denoising steps to 10 (our default setting in Appendix C) and increasing the ensemble size to 10 respectively. Following Sec. 4.3, we plot the correlation of the estimated rewards with L2 errors for the three variants in Fig. 13. It shows that increasing the denoising steps can greatly increase the relative contrast of rewards, indicating that denoising step is a more important factor than ensemble size under the same computation budget for reward estimation.

Reward Estimation for Commands. To show that the proposed reward is also practical for other actions, we estimate the rewards of ground truth commands from Waymo and compare them with the rewards of random commands. The results in Table 4 suggest that our reward is also competent for command selection.

D.2 Additional Ablation Studies

Action Independence Constraint. To prove the effectiveness of our learning strategy for action control, we additionally conduct a comparison by removing the action independence constraint proposed in Sec. 3.2. We train two variants on nuScenes at the resolution of 320 $\times$ 576 pixels for 62K steps. The comparison results are shown in Table 4.

Triangular Guidance Scheme. We further compare the introduced classifier-free guidance scheme with the vanilla scheme and the linear scheme to verify its necessity. Fig. 14 shows that our triangular scaling attains the best trade-off between visual quality and saturation preservation.

LoRA Adaptation. To show the necessity of applying LoRA in Sec. 3.2, we train two variants at a resolution of 320 $\times$ 576 pixels for 30K iterations. With the pretrained UNet weights fixed, we let one variant train LoRA and action projection layers in the attention blocks, while the other adjusts new projection layers only. As shown in Fig. 15, adding LoRA is essential for action control learning.

Appendix E Additional Visualizations

We further demonstrate the strong generalization ability of Vista by deploying it to different scenarios in the wild. The results in Fig. 16 and Fig. 17 show that Vista can make high-fidelity predictions in a very diverse range of scenarios.

E.2 Long-Horizon Prediction

In addition to Fig. 6, we provide more qualitative visualizations of long-horizon prediction in Fig. 18. Vista can continuously predict long-term futures with consistent content and motion.

E.3 Action Controllability

We provide more prediction results with different action inputs in Fig. 19. The results on OpenDV-YouTube-val and Waymo show that the versatile controllability of Vista can be readily transferred to different domains in a zero-shot manner.

E.4 Counterfactual Reasoning Ability

Counterfactual reasoning ability is one of the emergent abilities of world models . As shown in Fig. 20, Vista can effectively predict the counterfactual consequences caused by abnormal actions.

E.5 Human Evaluation Cases

To demonstrate the diversity of the scenes selected for human evaluation (Sec. 4), we show all cases gathered from OpenDV-YouTube-val , nuScenes , Waymo , and CODA in Fig. 21.

Appendix F Licence of Assets

Our training and evaluation utilize the data from four publicly licensed datasets . Our implementation is based on the codebase of SVD , which uses the MIT license. The pretrained checkpoint of SVD is distributed under the stable video diffusion non-commercial community license.