Diffusion-based Generation, Optimization, and Planning in 3D Scenes

Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, Song-Chun Zhu

Introduction

The ability to generate, optimize, and plan in 3D scenes is a long-standing goal for multiple research domains across computer vision, graphics, and robotics. Various tasks have been devised to achieve these goals, fostering downstream applications in motion generation , motion planning , grasp generation , navigation , embodied perception and manipulation , and autonomous driving .

Despite rich applications and great successes, existing models designed for these tasks exhibit two fundamental limitations for real-world 3D scene understanding.

First, most prior work leverages the conditional Variational Autoencoder (cVAE) for the conditional generation in 3D scenes. cVAE model utilizes an encoder-decoder structure to learn the posterior distribution and relies on the learned latent variables to sample. Although cVAE is easy to train and sample due to its simple architecture and one-step sampling procedure, it suffers from the posterior collapse problem ; the learned latent variable is ignored by a strong decoder, leading to limited generation diversity from these collapsed modes. Such collapse is further magnified in 3D tasks with stronger 3D decoders and more complex and noisy input conditions, e.g., the natural 3D scans .

Second, despite the close relations among generation, optimization, and planning in 3D scenes, there lacks a unified framework that could address existing discrepancies among these models. Previous work applies off-the-shelf physics-based post-optimization methods over outputs of generative models and often produces inconsistent and implausible generations, especially when transferring to novel scenes. Similarly, planners are usually standalone modules over results of generative model for trajectory planning or learned separately with the reinforcement learning (RL) , leading to gaps between planning and other modules (e.g., generation) during inference, especially in novel scenes where explorations are limited.

To tackle these limitations, we introduce SceneDiffuser, a conditional generative model based on the diffusion process. SceneDiffuser eliminates the discrepancies and provides a single home for scene-conditioned generation, optimization, and planning. Specifically, with a denoising process, it learns a diffusion model for scene-conditioned generation while training. During inference, SceneDiffuser jointly solves the scene-aware generation, physics-based optimization, and goal-oriented planning through a unified iterative guided-sampling framework. Such a design equips SceneDiffuser with the following superiority:

Building upon the diffusion model, SceneDiffuser significantly alleviates the posterior collapse problem of scene-conditioned generative models. Since the forward diffusion process can be treated as data augmentation in 3D scenes, it helps traverse sufficient scene-conditioned distribution modes.

Optimization

SceneDiffuser integrates the physics-based objective into each step of the sampling process as conditional guidance, enabling the differentiable physics-based optimization during both the learning and sampling process. This design facilitates the physically-plausible generation, which is critical for tasks in 3D scenes.

Planning

Based on the scene-conditioned trajectory-level generator, SceneDiffuser possesses a global trajectory planner with physics and goal awareness, making the learned planner generalize better to long-horizon trajectories and novel 3D scenes.

As illustrated in Fig. 1, we evaluate SceneDiffuser on diverse 3D scene understanding tasks. The results on human pose, motion, and dexterous grasp generation significantly improve, demonstrating plausible and diverse generations with 3D scene and object conditions. The results on path planning for 3D navigation and motion planning for robot arms reveal the generalizable and long-horizon planning capability of SceneDiffuser.

Related Work

Generating diverse contents and rich interactions in 3D scenes is essential for understanding the 3D scene affordances. Recently, we have witnessed several applications on conditional scene generation , human pose and motion generation in furnished 3D indoor scenes, and object-conditioned grasp pose generation . However, most previous methods rely on cVAE and suffer from the posterior collapse problem , especially when the 3D scene is natural and complex. In this work, SceneDiffuser addresses the posterior collapse with the diffusion-based denoising process.

Physics-based Optimization in 3D Scenes

Producing physically plausible generations compatible with 3D scenes is one of the challenges in the scene-conditioned generation. Previous work uses physics-based post-optimization or differentiable objective to integrate collision and contact constraints into the generation framework. However, post-optimization approaches are oftentimes inefficient and cannot be learned jointly with the generative models, yielding inconsistent generation results. Similarly, differentiable approaches post constraints on the final objective, thus cannot optimize the physical interactions during the sampling, producing implausible generations, especially when adapting to novel scenes. In this work, SceneDiffuser eliminates such inconsistency with the differentiable physics-based optimization integrated into each step of the sampling process.

Planning in 3D Scenes

The ability to act and plan in 3D scenes is critical for an intelligent agent and has led to the recent culmination of embodied AI research . Among all tasks, visual navigation has been most studied in the vision and robotics community . However, existing works rely heavily on model-based planning with the single-step dynamic model , lacking a trajectory-level optimization for long-horizon planning. Further, the physical interactions are not explicitly modeled into the planning. This deficiency makes it challenging to generalize to natural scenes, where exploration is limited, and fast learning and adaptation are required. In comparison, with the global trajectory planner based on a trajectory-level generator, SceneDiffuser demonstrates better generalization in long-horizon plans and novel 3D scenes.

Diffusion-based Models

Diffusion model has come forth into a promising class of generative model for learning and sampling data distributions with an iterative denoising process, facilitating the image , text , and shape generation . With flexible conditioning, it is further extended to the language-conditioned image , video , and 3D generation . Notably, Janner et al. integrate the generation and planning into the same sampling framework for behavior synthesis. To our best knowledge, SceneDiffuser is the first framework that models the 3D scene-conditioned generation with a diffusion model and integrates the generation, optimization, and planning into a unified framework.

Background

Given a 3D scene $\mathcal{S}$ , we aim to generate the optimal solution for completing the tasks (e.g., navigation, manipulation) given the goal $\mathcal{G}$ in the scene. We denote the state and action of an agent as $(\mathbf{s},\mathbf{a})$ . The dynamic model defines the state transition as $p(\mathbf{s}_{i+1}|\mathbf{s}_{i},\mathbf{a}_{i})$ , which is often deterministic in scene understanding (i.e., $f(\mathbf{s}_{i},\mathbf{a}_{i})$ ). The trajectory is defined as $\boldsymbol{\boldsymbol{\tau}}=(\mathbf{s}_{0},\mathbf{a}_{0},\cdots,\mathbf{s}_{i},\mathbf{a}_{i},\cdots,\mathbf{s}_{N})$ , where $N$ denotes the horizon of task solving in discrete time.

2 Planning with Trajectory Optimization

The scene-conditional trajectory optimization is defined as maximizing the task objective:

The dynamic model is usually known for trajectory optimization. Considering the future actions and states with predictable dynamics, the entire trajectory $\boldsymbol{\tau}$ can be optimized jointly and non-progressively with traditional or data-driven planning algorithms. Trajectory-based optimization benefits from its global awareness of history and future states, thus can better model the long-horizon tasks compared with single-step models in RL, where $\mathbf{a}_{0:N}^{*}=\operatorname*{arg\,max}_{\mathbf{a}_{0:N}}\sum_{i=0}^{N}r(\mathbf{s}_{i},\mathbf{a}_{i}|\mathcal{S},\mathcal{G})$ .

3 Diffusion Model

Diffusion models are a class of generative models that represent the data generation with an iterative denoising process from Gaussian noise. It consists of a forward and a reverse process. The forward process $q(\boldsymbol{\tau}^{t}|\boldsymbol{\tau}^{t-1})$ gradually destroys data $\boldsymbol{\tau}^{0}\sim q(\boldsymbol{\tau}^{0})$ into Gaussian noise. The parametrized reverse process $p_{\theta}(\boldsymbol{\tau}^{t-1}|\boldsymbol{\tau}^{t})$ recovers the data from noise with the learned normal distribution from a fixed timestep. The training objective for $\theta$ is denoising score matching over multiple noise scale . Please refer to the Appendix A for detailed descriptions of the diffusion model and its variants.

SceneDiffuser

SceneDiffuser models planning as trajectory optimization and solves the aforementioned problem with the spirit of planning as sampling, where the trajectory optimization is achieved by sampling trajectory-level distribution learned by the model. Leveraging the diffusion model with gradient-based sampling and flexible conditioning, SceneDiffuser models the scene-conditioned goal-oriented trajectory $p(\boldsymbol{\tau}^{0}|\mathcal{S},\mathcal{G})$ :

$p_{\theta}(\boldsymbol{\tau}^{0}|\mathcal{S})$ characterizes the probability of generating certain trajectories with the scene condition. It can be modeled using a conditional diffusion model with an iterative denoising process:

Optimization and Planning

$p_{\phi}(\mathcal{G}|\boldsymbol{\tau}^{0},\mathcal{S})$ represents the probability of reaching the goal with the sampled trajectory, where the goal can be flexibly defined by customized objective functions in various tasks. As shown in Eq. 4, the precise definition of this probability is $p_{\phi}(\mathcal{O}=1|\boldsymbol{\tau}^{0},\mathcal{S},\mathcal{G})$ , where $O$ is an optimality indicator that represents if the goal were achieved. Intuitively, the trajectory objective in Eq. 1 can be a good indicator for such optimality. We therefore expand $p_{\phi}(\mathcal{G}|\boldsymbol{\tau}^{t},\mathcal{S})$ as its exponential in Eq. 5:

Here, $\varphi_{o}(\boldsymbol{\tau}^{t}|\mathcal{S})$ denotes the objective for optimizing the trajectory with scene condition and is independent of task goal $\mathcal{G}$ . In scene understanding, $\varphi_{o}$ usually denotes plausible physical relationships (e.g., collision, contact, and intersection). $\varphi_{p}(\boldsymbol{\tau}^{t}|\mathcal{S},\mathcal{G})$ indicates the objective for planning (i.e., goal-reaching) with scene condition. Both $\varphi_{o}$ and $\varphi_{p}$ can be explicitly defined or implicitly learned from observed trajectories with proper parametrization.

1 Learning

$p_{\theta}(\boldsymbol{\tau}^{0}|\mathcal{S})$ is the scene-conditioned generator, which can be learned by the conditional diffusion model with the simplified objective of estimating the noise $\boldsymbol{\epsilon}$ , where

where $\hat{\alpha}^{t}$ is the pre-determined function in the forward process. With the learned $p_{\theta}(\boldsymbol{\tau}^{0}|\mathcal{S})$ , we sample $p(\boldsymbol{\tau}^{0}|\mathcal{S},\mathcal{G})$ by taking the advantage of the diffusion model’s flexible conditioning . Specifically, we approximate $p_{\phi}(\mathcal{G}|\boldsymbol{\tau}^{t},\mathcal{S})$ using the Taylor expansion around $\boldsymbol{\tau}^{t}=\boldsymbol{\mu}$ at timestep $t$ as

where $C$ is a constant, $\boldsymbol{\mu}=\boldsymbol{\mu}_{\theta}(\boldsymbol{\tau}^{t},t,\mathcal{S})$ and $\boldsymbol{\Sigma}=\boldsymbol{\Sigma}_{\theta}(\boldsymbol{\tau}^{t},t,\mathcal{S})$ are the inferred parameters of original diffusion process, and

where $\lambda$ is the scaling factor for the guidance. With Eq. 9, we can sample $\boldsymbol{\tau}^{t}$ with the guidance of optimizing and planning objectives.

Of note, $\varphi_{p}$ and $\varphi_{o}$ serve as the pre-defined guidance for tilting the original trajectory with physical and goal constraints. However, they can also be learned from the observed trajectories. During training, we first fix the learned base model of $p_{\theta}(\boldsymbol{\tau}^{0}|\mathcal{S})$ , then learn $\phi_{o}$ and $\phi_{p}$ for optimization and planning with the following objective:

Alg. 1 summarizes the training procedure.

2 Sampling

With different sampling strategies, SceneDiffuser can generate, optimize, and plan the trajectory in 3D scenes, under a unified framework of guided sampling. Alg. 2 summarizes the detailed sampling algorithm.

Sampling $\boldsymbol{\tau}^{0}$ from the distribution $p_{\theta}(\boldsymbol{\tau}^{0}|\mathcal{S})$ in Eq. 3 directly solves the conditional generation tasks. The sampled trajectories represent diverse modes and possible interactions with the 3D scenes.

Physics-based Optimization

The physical relations between each state and the environment are defined by $\varphi_{o}$ in Eq. 4 in a differentiable manner. For general optimization without the planning objective, the task goal $\mathcal{G}$ is to sample a plausible trajectory in 3D scenes. Therefore, we can draw physically plausible trajectories in 3D scenes by sampling from $p(\boldsymbol{\tau}^{0}|\mathcal{S},\mathcal{G})$ with Eq. 9.

Goal-oriented Planning

The goal-oriented planning can be formulated as motion inpainting under the sampling framework. Given the start state $\hat{\mathbf{s}}_{s}$ and the goal state $\hat{\mathbf{s}}_{g}$ , the planning module returns trajectory $\hat{\boldsymbol{\tau}}=(\hat{\mathbf{s}}_{0},\hat{\mathbf{a}}_{0},\cdots,\hat{\mathbf{s}}_{i},\hat{\mathbf{a}}_{i},\cdots,\hat{\mathbf{s}}_{g})$ that can reach the goal state. We set the first state as $\hat{\mathbf{s}}_{0}=\hat{\mathbf{s}}_{s}$ and define the goal state and reward of goal-reaching in $\varphi_{p}$ . For each step $i$ , we first keep the previous states and inpaint the remaining trajectory by sampling the goal-oriented SceneDiffuser with an iterative denoising process. Next, we take the action that can reach the next sampled state with $(\hat{\mathbf{a}}_{i-1},\hat{\mathbf{s}}_{i})$ . As illustrated in Alg. 2, we repeat the planning steps until reaching the goal or the maximal planning step. Our planner leverages the trajectory-level generator, thus more generalizable to long-horizon trajectories and novel scenes.

3 Model Architecture

The design of SceneDiffuser follows the practices of conditional diffusion model . Specifically, we augment the time-conditional diffusion model with cross-attention for flexible conditioning. As shown in Fig. 2, for each sampling step, the model computes the cross-attention between the 3D scene condition and input trajectory, wherein the key and value are learned from the condition, and the query is learned from the input trajectory. The computed vector is fed into a feed-forward layer to estimate the noise $\boldsymbol{\epsilon}$ . The 3D scene is processed by a scene encoder (i.e., Point Transformer or PointNet ). Please refer to the Appendix B for details.

4 Objective Design

For optimization and planning objectives discussed in Sec. 4, we consider two types of trajectory objectives: (i) trajectory-level objective, and (ii) the accumulation of step-wise objective. For optimization, we consider step-wise collision and contact objective, as well as trajectory level smoothness objective, i.e., $\{\varphi_{o}^{\text{collision}},\varphi_{o}^{\text{contact}},\varphi_{o}^{\text{smoothness}}\}$ . For planning, we consider the accumulation of simple step-wise distance i.e., $\varphi_{p}^{L_{2}}$ . Please refer to the Appendix C for implementation details of our objective design. Empirically, we observe that parameterizing the objectives with timestep $t$ and increasing the guidance during the last several diffusion steps will enhance the effect of guidance.

Experiments

To demonstrate SceneDiffuser is general and applicable to various scenarios, we evaluate SceneDiffuser on five scene understanding tasks. For generation, we evaluate the scene-conditioned human pose and motion generation and object-conditioned dexterous grasp generation. For planning, we evaluate the path planning for 3D navigation and motion planning for robot arms. We first introduce the compared methods used in our experiments, followed by detailed settings, results analyses, and ablative studies for each task. Due to the page limit, we refer to the Appendix for more details about the implementation, experimental settings, and additional results and ablations.

For conditional generation tasks, we primarily compare SceneDiffuser with the widely-adopted cVAE model and its variants. We also compare with strategies for optimizing the physics of the trajectory in the cVAE, including integrating into training as loss and plugging upon as the post-optimization. For planning, we compare with a stochastic planner learned by imitation learning using Behavior Cloning (BC) and a simple heuristic-based deterministic planner guided by $L_{2}$ distance.

2 Human Pose Generation in 3D Scenes

Scene-conditioned human pose generation aims to generate semantically plausible and physically feasible single-frame human bodies within the given 3D scenes. We evaluate the task on the 12 indoor scenes provided by PROX and the refined version of PROX’S per-frame SMPL-X parameters from LEMO . The input is the colored point cloud extracted by randomly downsampling the scene meshes provided in PROX. Training/testing splits are created following the literature , resulting in $\sim{}53$ k frames in 8 scenes for training and others for testing.

Metrics

We evaluate the physical plausibility of generated poses with both direct human evaluations and indirect collision and contact scores. For the direct measure, we randomly selected 1000 frames in the four test scenes and instructed seven participants to decide whether the generated human pose was plausible. We compute the mean percentage of plausible generation and term this metric as the plausible rate. For indirect measures, we report (i) the non-collision score of the generated human bodies by calculating the proportion of the scene vertices with positive SDF to the human body and (ii) the contact score by checking if the body contact with the scene in a distance below a pre-defined threshold. Following the literature , we evaluate the diversities of global translation, generated SMPL-X parameters, and the marker-based body-mesh representation . Specifically, we calculate the diversity of generated pose with the Average Pairwise Distance (APD) and standard deviation (std).

Results

Tab. 1 quantitatively demonstrates that SceneDiffuser generates significantly better poses while maintaining generation diversity. We further provide qualitative comparisons between baseline models and SceneDiffuser in Fig. 3. While achieving a comparable performance of diversity, collision, and contact, our model generates results that contain considerably more physically plausible poses (e.g., floating, severe collision). This is reflected by the significant superiority (i.e., over 30%) over cVAE-based baselines on plausible rates. We observe this large improvement both quantitatively from the plausible rate and non-collision score and qualitatively in Fig. 3. Notably, our optimization-guided sampling improves the generator with 25% on the plausible rate, showing the efficacy of the proposed optimization-guided sampling strategy and its potential for a broader range of 3D tasks with physic-based constraints or objectives.

3 Human Motion Generation in 3D Scenes

We consider generating human motion sequences under two different settings: (1) condition solely on the 3D scene, and (2) condition on both the starting pose and the 3D scene. We use the same human and scene representation as in Sec. 5.2 and clip the original LEMO motion sequence into segments with a fixed duration (60 frames). In total, we obtain 28k motion segments with the distance between each start and end pose being longer than 0.2 meters. We follow the same split in Sec. 5.2 for training/testing and the same evaluation metrics for the pose generation. We report the average values of pose metrics over motion sequence as our performance measure.

Results

As quantitatively shown in Tab. 2, SceneDiffuser consistently generates high-quality motion sequences compared to cVAE baselines. Specifically, our generated motion outperforms baseline models on both plausible rate and contact scores. This performance gain indicates better coverage of motion that involves rich interaction with the scene while remaining physically plausible. It also causes lower diversity in metrics (e.g., translation variance) since the plausible space for the motion is limited compared with cVAE. Empirically, we observe that providing the start position of motion as a condition constrains possible future motion sequences and leads to a drop in generation diversity for all models. In addition, providing the start condition benefits the physical plausibility since the motion starts from a plausible pose. We also note only a marginal performance improvement after applying optimization-guided sampling. One potential reason is that the generated motions are already plausible and receive small guidance from the optimization. As qualitatively shown in Fig. 4, SceneDiffuser generates diverse motions (e.g., “sit,” “walk”) from the same start position in unseen 3D scenes.

4 Dexterous Grasp Generation for 3D Objects

Metrics

We evaluate models in terms of success rate, diversity, and collision depth. We test if a grasp is successful in IsaacGym by applying external forces to the object and measuring the movement of the object. To measure how learned models capture the diversity of successful grasping pose in the training data, we report the success rate of generated poses that lies at different variance levels from the mean pose of training data. We measure the collision depth as the maximum depth that the hand penetrates the object in each successful grasp for testing models’ performance on physically correct grasps. In all cases, we ignore the root transformation of the hand as it does not contribute to the diversity of grasping types.

Results

Tab. 3 quantitatively demonstrates that SceneDiffuser generates significantly better grasp poses in terms of success rate while correctly balancing the diversity of generation and grasp success. This result indicates that the SceneDiffuser achieves a consistently high success rate without much performance drop when the generated pose diverges from the mean pose in the training data. We also show that, by applying optimizer upon SceneDiffuser, the guided sampling process can reduce the violation of physically implausible grasping poses, outperforming the state-of-the-art baseline without additional training or intermediate representation (i.e., contact maps). We provide qualitative results in Fig. 5 for visualization.

5 Path Planning for 3D Scene Navigation

Metrics

We evaluate the planned results by checking if the “robot” can move from the start to the target without collision along the planned trajectory. We report the average success rate and planning steps over all test cases.

Results

As shown in Tab. 4, SceneDiffuser outperforms both the BC and the deterministic planner baseline. These results indicate the efficacy of guided sampling with the planning objective, especially given that all test scenes are unseen during training. Crucially, as simple heuristics (like $L_{2}$ ) oftentimes lead to dead-ends in path planning, SceneDiffuser can correctly combine past knowledge on the scene-conditioned trajectory distribution and planning objective under specific unseen scenes to redirect planning direction, which helps to avoid obstacles and dead-ends to reach the goal successfully. Compared with the baseline models, our model also requires fewer planning steps while maintaining a higher success rate. This suggests that SceneDiffuser successfully navigates to the target without diverging even in long-horizon tasks, where classic RL-based stochastic planners suffer (i.e., the low performance of BC).

6 Motion Planning for Robot Arms

Metrics

Similar to Sec. 5.5, we evaluate the generated trajectories by success rate on unseen scenes and the average number of planning steps. We consider a trajectory successful if the robot arm reaches the goal pose by a certain distance threshold within a limited number of steps.

Results

We observe similar overall performance as in Sec. 5.5. Tab. 4 shows that SceneDiffuser consistently outperforms both the RL-based BC baseline and the deterministic planner baseline. SceneDiffuser’s planning steps for successful trials are also comparable with the deterministic planner, showing the efficacy of the planner in long-horizon scenarios.

7 Ablation Analyses

We explore how the scaling coefficient $\lambda$ influences the human pose generation results. We report the diversity and physics metrics of sampling results under different $\lambda$ s, ranging from 0.1 to 100. As shown in Tab. 5, $\lambda$ balances generation collision/contact and diversity in human pose generation. Specifically, $\lambda=1.0$ leads to the best physical plausibility while larger $\lambda$ values lead to diverse generation results. We attribute this effect to the optimization as with bigger $\lambda$ s; the optimization will draw poses away from the scene. Due to the page limit, we provide more ablative studies in Appendix E, including the sampling steps, choices and hyperparameters of objectives, and model architectures.

8 Limitation

The primary limitation of the SceneDiffuser is its slow training and test speed compared to previous scene-conditioned generative models, a common issue of diffusion-based methods. We also observe that the optimization and planning are highly dependent on the objective designs, which requires efforts on hyper-parameter tuning.

Conclusion

We propose the SceneDiffuser as a general conditional generative model for generation, optimization, and planning in 3D scenes. SceneDiffuser is designed with appealing properties including scene-aware, physics-based, and goal-oriented. We demonstrate that the SceneDiffuser outperforms previous models by a large margin on various tasks, establishing its efficacy and flexibility.

A promising future direction is extending SceneDiffuser to richer 3D representations, including RGB-D images, semantic images, bird-eye view (BEV) images, videos, 3D meshes, and neural radiance field (NeRF) . Such flexible conditions consume a tremendous amount of 3D training data, which is also a significant challenge. We also hope to extend the SceneDiffuser to outdoor scenes, e.g., the autonomous driving scenarios . Moreover, the SceneDiffuser can be combined with recent large language models (LLMs) for automatic generation and planning with natural language instructions in 3D scenes, which is promising for the vision and robotics community. Finally, SceneDiffuser can serve as the tool for analyzing the behaviors of humans and agents if we can properly learn the planning objective, which naturally encodes the values and preferences that underlie the trajectories.

Acknowledgement

We thank Ruiqi Gao and Ying Nian Wu for their helpful discussions and suggestions. This work is supported in part by the National Key R&D Program of China (2021ZD0150200) and the Beijing Nova Program.

References

Appendix A Background for Diffusion Model

A diffusion model is defined by a forward process that gradually corrupts data $\boldsymbol{\tau}^{0}\sim q(\boldsymbol{\tau}^{0})$ over $T$ timesteps

and a reverse process $p_{\theta}(\boldsymbol{\tau}^{0})=\int p_{\theta}(\boldsymbol{\tau}^{0:T})d\boldsymbol{\tau}^{1:T}$ where

The forward process hyperparameters $\beta^{t}$ are set so that $\boldsymbol{\tau}^{T}$ is approximately distributed according to a standard normal distribution, so $\boldsymbol{\tau}^{T}$ is set to a standard normal prior as well. The reverse process is trained to match the joint distribution of the forward process by optimizing the evidence lower bound (ELBO) . As suggested by the literature , we can use the reverse process parametrizations as:

where $\alpha^{t}=1-\beta^{t}$ , $\hat{\alpha}^{t}=\sum^{t}_{s=1}\alpha^{s}$ , and $\hat{\beta}^{t}=\frac{1-\hat{\alpha}^{t-1}}{1-\hat{\alpha}^{t}}\beta^{t}$ .

We can optimize modified loss instead of the ELBO to improve the sample quality, depending on whether we learn $\boldsymbol{\Sigma}$ or treat it as a fixed hyper-parameter. For the non-learned case, we use the simplified loss:

It is a weighted form of the ELBO that resembles denoising score matching over multiple noise scale .

The goal of the conditional diffusion model is to learn a conditional distribution $p_{\theta}(\boldsymbol{\tau}^{0}|\mathbf{c})$ . We modify the diffusion model to include the condition $\mathbf{c}$ as input to the inverse process:

Appendix B Model Architectures

For the tasks of human pose/motion generation in 3D scenes and path planning for 3D scene navigation, we use the same scene encoder, i.e., the PointTransformer adopted from the original architecture. We pre-train the scene encoder with indoor scene semantic segmentation task on ScanNet dataset and freeze it while training SceneDiffuser. The outputs of the scene encoder are used as the key and value of the cross-attention module.

For processing the trajectory, we employ an FC layer and positional embedding to obtain the high-dimensional feature of the trajectory. We then fuse the trajectory feature with denoising timestep embedding with a ResBlock. After that, we feed the fused feature vectors to a self-attention module and use them as the query of the cross-attention module. Finally, the computed vector is fed into a feedforward layer to estimate the noise $\boldsymbol{\epsilon}$ .

For the task of dexterous grasp generation for 3D objects, we use PointNet as the 3D object encoder. Before the cross-attention module, the outputs of PointNet are reshaped to $(N_{\text{token}},N_{\text{feat}})$ , where $N_{\text{token}}$ refers to the number of tokens and $N_{\text{feat}}$ refers to the dimensions of the feature.

For the task of motion planning for robot arms, we adopt PointTransformer as the scene encoder, which is jointly trained from scratch with SceneDiffuser.

Appendix C Objective Design

For human pose and motion generation in 3D scenes, we encourage contact and non-collision between the generated human body meshes and the scene meshes. Following , we design optimization objective $\varphi_{o}(\boldsymbol{\tau}^{t}|\mathcal{S})=\alpha_{1}\varphi_{o}^{\text{collision}}+\alpha_{2}\varphi_{o}^{\text{contact}}$ for pose generation and $\varphi_{o}(\boldsymbol{\tau}^{t}|\mathcal{S})=\alpha_{1}\varphi_{o}^{\text{collision}}+\alpha_{2}\varphi_{o}^{\text{contact}}+\alpha_{3}\varphi_{o}^{\text{smoothness}}$ for motion generation. $\alpha$ is the balancing weight. $\varphi_{o}^{\text{collision}}$ minimizes the negative signed-distance values of the body mesh vertices given the negative signed distance field (SDF) of the 3D scene $\Phi_{s}^{-}(\cdot)$ , which is formulated as

where $\mathcal{M}^{t}$ is the SMPL-X body mesh at denoising step $t$ . $\varphi_{o}^{\text{contact}}$ minimize the distance between contact body parts of the generated body mesh and the scene mesh, which is formulated as

where $C(\cdot)$ is the operation of selecting contact part vertices from the SMPL-X body mesh according to the annotation in . We design the smoothness objective to smooth the motion over time by minimizing the velocity difference of consecutive frames, which is formulated as

where $L$ is the length of the motion sequence. We empirically set $\alpha_{1}=1.0$ , $\alpha_{2}=0.02$ , and $\alpha_{3}=0.001$ .

For dexterous grasp generation, we punish the collision between the robotic hand mesh and the object mesh. We design optimization $\varphi_{o}(\boldsymbol{\tau}^{t}|\mathcal{S})=\varphi_{o}^{\text{collision}}$ . $\varphi_{o}^{\text{collision}}$ is similar to Eq. A1, where 3D scene is represented as an object and $\mathcal{M}^{t}$ as the robotic hand mesh at denoising step $t$ .

For path planning for the 3D scene navigation task, we design an optimization objective $\varphi_{o}$ and $\varphi_{p}$ for generating collision-free paths toward the goals. The collision-free objective maximizes the distance between the robot and the scene vertices in the robot cylinder, formulated as

where $\text{ReLU}(x)=\text{max}(0,x)$ , $r$ is the radius of the robot cylinder, and $\text{dist}(\cdot)$ compute the Euler distance between scene vertices and robot position on the 2D plane. The planning objective $\varphi_{p}$ encourages the generated paths toward the target position. In our work, we formulate it as

For robot arm motion planning, we design the planning objective $\varphi_{p}$ similar to Eq. A5. The objective is defined as

where $N$ denotes to the number of revolute joints and $j$ refers to $j$ -th revolute joint.

Appendix D Implementation Details

To train SceneDiffuser, we use Adam optimizer with 0.0001 as the learning rate. We use 4 NVIDIA A100 GPUs to train 100 epochs with a batch size of 128. The number of diffusion steps $T$ in this task is set as 100. For optimization guidance sampling, we empirically set scale coefficient $\lambda=2.5$ .

D.2 Human Motion Generation in 3D Scenes

For the two different settings (with and without start position) of human motion generation in 3D scenes, we represent the single-frame human body of the motion sequence as the same as the pose generation. To collect training data, we clip the motion sequences in the PROX dataset into motion segments with a fixed duration, i.e., 60 frames. We use the same evaluation metrics as pose generation and report the average values over motion sequence as the motion generation performance measure. In this task, the optimizer is Adam, and the learning rate is 0.0001. We use 4 NVIDIA A100 GPUs to train 300 epochs with 200 diffusion steps and 128 batch size. For optimization guidance sampling, we empirically set scale coefficient $\lambda=2.5$ .

D.3 Dexterous Grasp Generation for 3D Objetcs

To train SceneDiffuser on this task, we use Adam optimizer, set the learning rate as 0.0001, and use 1 NVIDIA A100 GPU to train 2100 epochs with 64 batch size. For optimization guidance sampling, we empirically set scale coefficient $\lambda=1.0$ .

D.4 Path Planning for 3D Scene Navigation

In this task, we consider 3D navigation in realistic scenes, where the goal is to plan plausible trajectories for a physical robot from the given start position $\hat{\mathbf{s}}_{0}$ to the given target position $\mathcal{G}$ in a furnished 3D indoor scene $\mathcal{S}$ . We represent the hallucinated physical robot as a cylinder to simulate physically plausible trajectories which are collision-free in the 3D scene. The robot can move in all directions within a distance in each step without height change. We set the maximum moving distance as 0.08m, the robot radius as 0.08m, and the robot height as infinite, which means the robot can only move on the floor that is not occupied.

To construct room-level realistic scenarios for path planning, we manually select 61 indoor scenes from ScanNet , as shown in Fig. A1. We annotate these scenes with spatially dense and physically plausible navigation graphs and collect about 6.3k trajectories by searching the shortest paths between the randomly selected start and target graph nodes. As the distance between nodes may be too long for a robot to move in one step, we refined the trajectories according to the maximum moving distance. These trajectories have an average step of 60.0, a minimal step of 32, and a maximum step of 120. We use 4.7k trajectories in 46 scenes as the training data and the rest 1.6k trajectories in 15 unseen scenes for evaluation. We set the maximum number of planning steps as 150.

During training, we set the fixed trajectory horizon as 32. We use 4 NVIDIA A100 GPUs to train 50 epochs with 100 diffusion steps and a batch size of 128. The optimizer is Adam, and the learning rate is 0.0001. During inference, we empirically set the scale coefficient of optimization guidance as 1.0 and the scale coefficient of planning guidance as 0.2.

D.5 Motion Planning for Robot Arm

We use the Franka Emika with seven revolute joints as the robot arm and randomly generate cluttered tabletop scenes with primitives following specific heuristics. For each scene, we position the robot arm at the center of the table and use moveit2 motion planner to synthesize trajectories constrained by a pair of start and goal poses of the end effector. We collected 19,800 collision-free trajectories over 200 clustered scenes.

During inference, we execute the planned motions of SceneDiffuser in IsaacGym . We consider the planning is successful if our robot arm reaches the goal pose by a certain $L_{2}$ norm distance (e.g., $0.2$ ) in the space of revolute joints. Note that the simulation can not run infinitely; therefore, we set a limited number of simulation steps (e.g., $300$ ). For the efficiency evaluation, we capture the average number of simulation steps.

To train SceneDiffuser on this task, we use Adam optimizer, set the learning rate as 0.0001, and use 4 NVIDIA A100 GPUs to train 200 epochs with 128 batch size. We empirically set the scale coefficient of planning guidance as 0.2 during inference.

D.6 Scaling Factor for the Guidance

Similar to Ho et al. , we notice that the parameter $\Sigma$ in Eq. 9 decreases as the denoising step $t$ decreases, which gradually weakens the guidance during the denoising process. Instead of using a constant as the scaling factor, we empirically schedule the scaling factor by dividing it by $\Sigma$ . It reformulates Eq. 9 as

Appendix E Additional Ablative Experiments

We ablate different model architectures, including the scene encoder and noise prediction module in SceneDiffuser, diffusion steps and scale coefficient in the optimizer of dexterous grasp generation task, and fixed frames and planning objectives of path planning for 3D scene navigation task.

As shown in Tab. A1, we study how different scene model influences the dexterous grasp generation results. We use PointNet and PointNet++ as different scene models to extract the object feature. For more diversity evaluation, we capture the mean standard deviation among all revolute joints of the robotic hand qpos. We find that the global feature extracting from PointNet makes it easier for the model to learn a mean pose to obtain a higher grasping success rate. In contrast, the local feature extracting from PointNet++ makes the generated grasp pose more diverse.

As shown in Tab. A2, we ablate the module for noise prediction. We compare the design of cross-attention and self-attention for processing the condition and input. Cross-attention indicates learning query from the input $\boldsymbol{\tau}_{t}$ and learning key and value from the scene condition $\mathcal{S}$ . Self-attention indicates concatenating $\boldsymbol{\tau}_{t}$ and scene features $\mathcal{S}$ and learning with self-attention. Through our experiments, we find that with self-attention, the model learns better to capture the joint distribution of input and condition. This leads to a slightly lower diversity but better generation quality and success rate.

E.2 Diffusion Steps

We study different diffusion steps $T$ in Tab. A3, where we use PointNet++ as the scene encoder with cross-attention design. We report the success rate, diversity, and depth collision of sampling results in the test set under different diffusion steps, ranging from 30 to 1000. $T$ balance the diversity and success rate in dexterous grasp generation, where $T=30$ leads to the best diversity of generated grasp pose and $T=1000$ leads to the best all success rate.

E.3 Scale Coefficient

Among different time steps $T$ , we ablate scale coefficient $\lambda$ of the optimization guidance in dexterous grasp generation in Tab. A3, ranging from $0.0$ (denoted as w/o in the table) to $1.0$ . Through extensive experiments, we observe that, in general, the $\alpha$ trade off the depth collision and grasp success rate. A larger $\alpha$ value leads to fewer collisions and draws the grasp pose away from the object simultaneously, which losses the grasp stability and lowers the success rate.

We also ablate the scale coefficient of the planner in path planning for 3D scene navigation, as shown in Tab. A4. Too small or too large scale coefficients both lead to a performance drop. It is due to that a small value cannot provide sufficient guidance. In contrast, a large value diminishes trajectory diversity with strong guidance, preventing it from escaping obstacles and dead-ends.

E.4 Fixed Frames for Planning

Since we formulate the planning algorithm as inpainting, we also ablate the number of the fixed frame in it. In path planning for 3D scene navigation, we train the SceneDiffuser with a trajectory length of 32. Therefore, we compare the settings of fixing the first 1, 7, 15, 23, and 31 frames for inpainting during the denoising process. The results in Tab. A4 show that the model achieves the best performance while fixing the first 15 frames.

E.5 Planning Objectives

To explore the influence of different planning objectives, we design the following four planning objectives and compare them with Eq. A5.

We compute the L1 distance between the last frame of the denoised trajectory and the target position, i.e.,

We summarize the L1 distance between all frames of the denoised trajectory and the target position, i.e.,

Similar to Eq. A5, we only consider the last frame of the denoised trajectory, i.e.,

We compute the L1 distance between the target position and the frame closest to the target, i.e.,

The planning results in Tab. A5 indicate that encouraging all frames of the denoised trajectory to reach the target position surpasses considering only one frame. Besides, directly using $L1$ distance tends to achieve a better performance than additionally applying the exponential function.

Appendix F Trainable Optimization and Planning

As shown in Alg. 1, we can optionally train the optimization and planning process with observed trajectories. To verify its efficacy, we optimize the trainable scaling factor $\lambda$ of the optimization guidance in pose generation and path planning tasks. Specifically, we use a small MLP model to map the timestep embedding of each step into a scalar, i.e., the scaling factor. During training, we only optimize the MLP while fixing the pre-trained diffusion model. We plot the learned scaling factor varying with the denoising step from 100 to 1, as shown in Fig. A2. We observe that the scaling factor of the denoising process at the beginning is much smaller than at the end. We speculate that the target signal at the beginning of the denoising process is mostly noise so a large scaling factor cannot optimize it properly. The scaling factor decrease in the last several steps may be because this can alleviate excessive guidance and balance the guidance from other modules, such as the planner.

Appendix G More Qualitative Results

We show more qualitative results in Fig. A3.

Motion Generation in 3D Scenes

We provide more sampled human motions from the same start pose in other scenes, as shown in Fig. A4. Please refer to the supplemental demo video for better visualization with rendered animations.

Path Planning for 3D Scene Navigation

Fig. A5 shows some qualitative results of path planning for 3D scene navigation.

Dexterous Grasp Generation for 3D Objects

We show more qualitative results in Fig. A6. Note that the objects are unseen during training time.

Motion Planning for Robot Arm

We render the planning results into animations for visualization. Please refer to the supplemental demo video for the qualitative results.