Goal-Conditioned Imitation Learning using Score-based Diffusion Policies

Moritz Reuss, Maximilian Li, Xiaogang Jia, Rudolf Lioutikov

I Introduction

Goal-conditioned Behavior Learning aims to train versatile embodied agents, that can handle a wide range of daily tasks. A common approach to tackle this challenge is Goal-conditioned Imitation Learning (GCIL). GCIL only requires an offline dataset without additional rewards or expensive environment interactions for training. However, GCIL typically requires a set of predefined tasks and a large number of labeled and segmented expert trajectories for each task, which can be costly and time-consuming. Additionally, it does not generalize well to new scenes and different tasks. Instead of teaching an agent a limited number of predefined goals, Learning from Play (LfP) provides an effective way of collecting task-agnostic, teleoperated, uncurated, freeform datasets. Such datasets consist of rich, meaningful, multimodal interactions with the environment that cover different areas of the state space. Instead of manually labeling the trajectories, LfP pairs random sequences of each trajectory with one or more future states, i.e., the goal state, of the respective trajectory. Goal-conditioned policies distill useful, goal-oriented behavior from this collected play interaction data. However, learning from play data remains an open challenge, partially due to the multimodal nature of the demonstrations, e.g., the same task can be solved in very different ways and different tasks can be solved in very similar ways.

Effective behavior learning from these datasets demands policies that maintain such multimodal solutions and that are expressive enough to remain close to the seen state-action distribution of the offline data for executing long-term horizon skills. Most prior work tries to deal with this challenge, by combining generative models, such as Variational Autoencoders (VAEs) and Generative Pretrained Transformer (GPTs) , with additional models and networks to explicitly encode multimodality or hierarchy. However, these methods require supplementary networks or separation of skill execution and planning within their architecture, as the policy expression is not sufficient or cannot handle the multimodality of the observed behaviors. Additionally, multiple learning objectives are typically required, e.g. for low- and high-level policies, which provides additional tuning challenges.

We propose a novel approach, BEhavior Generation using ScOre-based Diffusion models (BESO), which excels in learning goal-conditioned policies solely from reward-free, offline datasets. BESO uses Score-based Diffusion Models (SDMs) , a new class of generative models, that progressively diffuse data to noise through a forward Stochastic Differential Equation (SDE). By training a neural network, known as the score or denoising model, to approximate the score function, one can reverse the SDE to generate new samples from noise in an iterative sampling process.

We demonstrate several benefits of modeling the goal-conditioned action distribution using a score-based diffusion model. First, we show, that the expressiveness of SDMs and their ability to capture multimodal distributions is key for effective conditioned behavior generation. On several challenging goal-conditioned benchmarks, including the conditioned Relay Kitchen and Block-Push environment , BESO consistently outperforms state-of-the-art methods such as C-BeT and Latent Motor Plans . Second, by leveraging Classifier-Free Guidance Training of SDMs, BESO effectively learns two policies simultaneously: a goal-dependent policy and a goal-independent policy, which both can be used together or independently at test time. Third, our model is easy and stable to train with a single training objective without additional rewards. This contrasts with other state-of-the-art generative models, such as Implicit Behavior Cloning (IBC) , or hierarchical policies . Fourth, SDMs do not restrict the choice of the model architecture as in other generative models such as VAEs or energy-based models (EBMs) . Thus, we apply a novel Transformer architecture augmented with preconditioning to synthesize step-based actions given a sequence of observations and desired goal states. Finally, BESO can diffuse new actions fast. While current diffusion-based policies require $30+$ denoising steps for a single action prediction to achieve good results, our proposed approach, BESO, performs exceptionally well on challenging GCIL benchmarks, outperforming state-of-the-art goal-conditioned policies, while using only $3$ denoising steps. We achieve this, by using recent advances in Score-based Diffusion Models, which separate the training and inference process and applying novel numerical solvers designed for fast diffusion inference . Therefore, we systematically evaluate the essential components of SDMs for fast and effective step-based action generation.

BESO, a new policy representation based on score-based diffusion models for effective goal-conditioned behavior generation from uncurated play data

Use of Classifier-Free Guidance based Diffusion Policy to simultaneously learn a goal-dependent and goal-independent policy from play

Systematic evaluation of key components for fast and efficient action generation using Score-based Diffusion policies combined with extensive experiments and ablation studies

II Related Work

Diffusion Generative Models. Score-based generative models (SGMs) and Denoising Diffusion Probabilistic Models (DDPMs) are two different variants of score-based diffusion models (SDMs). These models corrupt a data distribution with increasing Gaussian noise and use neural networks to learn to reverse this corruption to generate new data samples from noise. The two different models have been unified using the stochastic differential equation (SDE) framework . SDEs describe the diffusion process as a time-continuous process instead of using discrete noise levels. BESO follows the SDE formulation proposed by Karras et al. . To draw new samples from the diffusion models, they need to reverse the SDE discretized over $T$ time steps. The SDE contains a probability flow ODE with the same marginal distributions, which allows for fast sampling . ODE solvers do not add noise during the inference process, which can reduce the number of function evaluations and accelerate sampling . Sampling can be further accelerated using specialized numerical ODE solvers designed for diffusion inference . SDMs achieved state-of-the-art results in various tasks including image generation , text-based image synthesis and human motion generation .

Goal-Conditioned Imitation Learning (GCIL). It is a sub-domain of Imitation Learning , where each demonstration is augmented with one or more goal-states that are indicative of the task that the demonstration was provided for. The goal-state contains information that a learning method can leverage to disambiguate demonstrations. Consequently, a goal-conditioned policy, i.e., a policy that includes the goal-state in its condition set, can use a given goal-state to adapt its behavior accordingly. Similarly, goal-states have also extended the domain of reinforcement learning through Goal-Conditioned Reinforcement Learning (GCRL) , where the agent is not provided expert demonstrations but reward signals instead. Typically these reward signals are difficult to define, especially for complex tasks and environments, providing demonstrations is often a more natural option in such situations. Additionally, the policy rollouts required by GCRL are often expensive in real-world settings. Recent work investigated Goal Conditioned Offline Reinforcement Learning , which does not require these expensive rollouts during training.

Learning from Play. The goal of Learning from Play (LfP) is to learn goal-specified behavior from a diverse set of unlabeled state-action trajectories. Classical imitation learning datasets typically consist of uni-modal, segmented expert trajectories in a narrow state-space. Play data, on the other hand, is characterized by unsegmented, multimodal trajectories. This makes learning meaningful behaviors more challenging, as the policies need the ability to deal with multiple ways of solving a task, distinguish between similar ways to solve different tasks, as well as the ability of long-horizon planning to reach goals far into the future. Prior work aimed to extract representations from play data for effective downstream policy learning or learned self-supervised representations of skills, referred to as latent plans, using Conditional Variational Autoencoders (CVAE) . Transformer-based architectures were also researched as a policy class for task-agnostic behavior learning . Another body of work tries to improve LfP, by focusing on the data aspect and learning from object-centric interactions, instead of randomly sampled sequences .

Generative Models in Policy Learning. Imitation Learning can be formulated as a state-occupancy matching problem, where the goal is to learn a policy that matches the state-occupancy distribution of expert demonstrations. The unknown expert demonstration can now be approximated through modern generative model architectures. One popular approach is the use of Generative Adversarial Networks (GANs) . These methods consist of a generator policy that learns to imitate the observed behavior of the expert and a discriminator, which distinguishes between real and fake trajectories. They require extensive rollouts during training, which is not feasible in our setting. Other approaches use CVAEs to learn a latent embedding to represent the underlying skills. Recent work also applied Energy-based models as conditional policies for behavior cloning . Normalizing flows have also been proposed as a policy representation .

Diffusion Generative Models in Robotics. Most approaches that apply diffusion models in robotics applications focus on the discrete DDPM variant . The DDPM Diffusion model has been used in Offline-RL to generate state-action or state-only trajectories using large U-net architectures . DDPM has also been applied as a policy regularization method in a step-based Offline-RL setting in combination with a learned Q-function . Recently, score-based generative models have been leveraged to synthesize cost functions for grasp pose configurations . In addition, Conditional score-based generative models have been proposed to learn the reward function for inverse reinforcement learning . The closest related work to BESO is Diffusion Policy and Diffusion-BC , which both propose the use of conditional, discrete DDPM as a new policy class for Behavior Cloning. Diffusion-BC synthesizes new actions in $50$ stochastic sampling steps. To improve the performance, Diffusion-BC uses $X$ -extra inference steps at the lowest noise level without additional noise. However, this method results in even slower action generation. BESO leverages the probability flow ODE combined with fast, deterministic samplers and optimized noise levels. Hence, BESO requires significantly fewer function evaluations in every action prediction.

III Problem Formulation and Method

In this section, we describe our approach to goal-conditioned behavior generation using Score-based diffusion models.

The Goal of GCIL is to learn a general-purpose goal-conditioned policy from uncurated play data. Given a set of unstructured, task-agnostic trajectories, $\mathcal{T}=\left\{\boldsymbol{\tau}_{k}|\boldsymbol{\tau}_{k}=((\boldsymbol{s}_{n}^{k},\boldsymbol{a}_{n}^{k}))_{n=1}^{N_{k}}\right\}$ , each trajectory can be split into a set of tuples containing sub-trajectory sequences and goal-states $\mathcal{D}_{k}=\left\{(\boldsymbol{o},\boldsymbol{g})|\boldsymbol{o}=(\boldsymbol{s}_{n},\boldsymbol{a}_{n})_{n=i}^{i_{\dagger}},\boldsymbol{g}=(\boldsymbol{s}_{n})_{n=j}^{j_{\dagger}},(\boldsymbol{s}_{n},\boldsymbol{a}_{n})\in\boldsymbol{\tau}_{k}\right\}$ , with $i\leq i_{\dagger}<j\leq j_{\dagger}$ denoting start and end steps of the sequence and goal-state respectively. As this formulation makes clear, the goal-state has to be one or more states of the same trajectory as the sequence and has to begin at some step after the respective sequence has ended. The set $\mathcal{D}_{k}$ can contain overlapping sequences and the final play dataset is given as $\mathcal{D}=\bigcup_{k=1}^{K}\mathcal{D}_{k}$ . For simplicity, the indices of $\boldsymbol{o}_{k}$ and $\boldsymbol{g}_{k}$ simply indicate that the sequence and goal state belong together and the indices in $(\boldsymbol{s}_{n},\boldsymbol{a}_{n})\in\boldsymbol{o}$ refer to the relative time step in the sequence. The state-action pairs in the sequence $\boldsymbol{o}_{k}$ leading to the goal state $\boldsymbol{g}_{k}$ are now treated as the optimal behavior to reach $\boldsymbol{g}_{k}$ . Goal-conditioned policies try to maximize the log-likelihood objective over the play dataset

Because of the multi-modal nature of the demonstrations, i.e. several trajectories leading to the same goal state, solving this objective successfully requires a policy that is capable of encoding such a multi-modal behavior.

III-B Score-based Diffusion Policies

We now aim to learn the policy distribution $\pi_{\mathcal{D}}\left(\boldsymbol{a}|\boldsymbol{s},\boldsymbol{g}\right)$ underlying the play dataset $\mathcal{D}$ and, hence, the given demonstrations. We do so by defining a continuous diffusion process, which maps samples from our play dataset by gradually adding Gaussian noise to the intermediate distributions $p_{t},t\in[0,T]$ with initial distribution $p_{0}=\pi_{\mathcal{D}}$ and final distribution $p_{T}$ .

The continuous diffusion process can be described using a stochastic-differential equation (SDE) . In this work, we define the SDE analogously to a recently introduced formulation :

where $\nabla_{\boldsymbol{a}}\log p_{t}(\boldsymbol{a}|\boldsymbol{s},\boldsymbol{g})$ refers to the score-function, $\omega_{t}$ is the Standard Wiener process, which can be understood as infinitesimal Gaussian noise. The noise scheduler is denoted by $\sigma_{t}$ , and $\beta(t)$ describes the relative rate at which the current noise is replaced by new noise. In our approach, we adopt $\sigma_{t}(t)=t$ , a method proven effective in image generation . At every timestep $t$ and related noise level there exists a corresponding marginal distribution $p_{t}(\boldsymbol{a}|\boldsymbol{s},\boldsymbol{g})$ , which is the result of injecting Gaussian noise to samples from $p_{\text{play}}$ . This can be expressed as $p_{t}(\boldsymbol{a}_{t}|\boldsymbol{a})=\mathcal{N}(\boldsymbol{a},\sigma_{t}^{2}\mathbf{I})$ . The final action distribution of the diffusion process is a known tractable prior distribution $\boldsymbol{a}_{T}=p_{T}$ . An unstructured Gaussian distribution $p_{T}=\mathcal{N}(\mathbf{0},\sigma_{T}^{2}\mathbf{I})$ is chosen without any information about the play data distribution.

In the case of BESO we are particularly interested in the Probability Flow Ordinary Differential Equation (ODE) within the SDE . This ODE shares the same marginal distributions $p_{t}(\boldsymbol{a}|\boldsymbol{s},\boldsymbol{g})$ as the SDE at every timestep, but without the additional random noise injections. By setting $\beta(t)=0$ , we recover the Probability Flow ODE from Eq. (2):

The negative score-function $-\nabla_{\boldsymbol{a}}\log p_{t}(\boldsymbol{a}|\boldsymbol{s},\boldsymbol{g})$ specifies the vector field of the current marginal distribution $p_{t}(\boldsymbol{a}|\boldsymbol{s},\boldsymbol{g})$ . This vector field points towards regions of low data density and is scaled with the product of the current noise level $\dot{\sigma}_{t}$ and the change of it $\dot{\sigma}_{t}$ .

III-C Diffusion Training

In order to generate new samples by numerically approximating the reverse ODE, we require an accurate estimate of the score function $\nabla_{\boldsymbol{a}}\log p_{t}(\boldsymbol{a}|\boldsymbol{s},\boldsymbol{g})$ for all marginal distributions $p_{t}$ in our diffusion process. To achieve this, we use a neural network $D_{\theta}(\boldsymbol{a},\boldsymbol{s},\boldsymbol{g},\sigma_{t})$ that matches the score for all marginal distributions $p_{t}(\boldsymbol{a}|\boldsymbol{s},\boldsymbol{g})$ .

The neural network is trained using the denoising score matching objective , where we add Gaussian noise to the actions and minimize the difference between the network’s output and the original actions:

where $\boldsymbol{a}$ is an action sample, and $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma_{t}^{2}\mathbf{I})$ represents the Gaussian noise. The losses at individual noise levels are weighted according to $\alpha(\sigma_{t})$ , and the current $\sigma_{t}$ is sampled from the noise training distribution $p_{\text{train}}$ . We use a truncated log-logistic distribution with location parameter $\alpha$ and scale parameter $\beta$ : $p_{\text{train}}\sim\text{LogLogistic}(\alpha,\beta)$ in the range of $\{\sigma_{\text{min}},\sigma_{\text{max}}\}$ . The training process is summarized in Alg. 1. This allows us to effectively learn the noise-conditioned score function for our diffusion process and generate samples from the conditional density, $p_{t}(\boldsymbol{a}|\boldsymbol{s},\boldsymbol{g})$ , using the Probability Flow ODE.

III-D Efficient Action Generation using Deterministic Samplers

New actions are generated by our policy by sampling from the prior distribution $\boldsymbol{a}_{T}\sim\mathcal{N}(\mathbf{0},\sigma_{T}^{2}\mathbf{I})$ and numerically simulating the reverse ODE or SDE by substituting the score-function with our learned model in Eq. (3). The process begins by selecting a random sample from our prior distribution, $\boldsymbol{a}_{T}\sim\mathcal{N}(\mathbf{0},\sigma_{T}^{2}\mathbf{I})$ , and then iteratively denoise this sample. Utilizing a random sample as a starting point enables the creation of diverse and multimodal actions, even when the underlying ODE is deterministic. The ODE can be solved numerically, by discretizing the differential equation starting from $T$ to . During the action prediction, we iteratively denoise the sample at $N$ -discrete noise levels. BESO employs the DDIM solver, as described in detail in Alg. 2 , for fast, deterministic sampling. The solver is a first-order deterministic sampler that is based on an exponential integrator method. A detailed comparison of state-of-the-art diffusion samplers is provided in Sec. -B of the Appendix, which concludes, that DDIM has the best overall performance. An additional evaluation on the influence of noise concludes that ODE solvers are competitive with SDE variants for action prediction tasks. Our ablation studies in Sec. -B suggest that only three denoising steps are necessary for BESO to generate actions with high accuracy. Increasing the number of inference steps further only marginally enhances the performance, while significantly slowing down the sampling process. Thus, we found that 3 steps strike the best balance between computational efficiency and performance. For inference, we can adapt the range of noise and the distribution of discrete timesteps. Based on empirical evaluations, we decide to use exponential time steps with a noise range of $\sigma\in\{0.005,1\}$ for most applications.

IV Goal-Guided Score-based Diffusion Policies

In this section, we introduce two variants of BESO optimized for synthesizing actions for goal-conditioned behavior.

Conditioned Policy (C-BESO). We define a goal-conditioned diffusion policy, $\pi\left(\boldsymbol{a}\middle|\boldsymbol{s},\boldsymbol{g}\right)$ , by directly learning the goal-and-state-conditioned distribution with our score-based generative model. In contrast to standard goal-conditioned behavior cloning, our diffusion policy allows us to capture multiple solutions present in the play data while still being expressive enough to solve long-term goals.

Goal-Classifier-Free Guided Policy (CFG-BESO). We additionally combine BESO with a popular conditioning method for diffusion models, Classifier-Free Guidance (CFG) . We train a goal-conditioned diffusion policy $\pi\left(\boldsymbol{a}\middle|\boldsymbol{s},\boldsymbol{g}\right)$ by applying a dropout rate of $0.1$ to the goal $\boldsymbol{g}$ , which also trains an implicit goal-independent policy $\pi\left(\boldsymbol{a}\middle|\boldsymbol{s}\right)$ within our goal-conditioned model. The generation process uses a combined gradient for the denoising process

where the guidance factor $\lambda$ balances the influence of the goal-conditioned and goal-independent gradient. In diffusion literature, $\lambda$ commonly ranges from $2$ to $7.5$ , to guide the diffusion model towards goal-conditional distribution $\pi\left(\boldsymbol{a}\middle|\boldsymbol{s},\boldsymbol{g}\right)$ . CFG has demonstrated significant performance improvements compared to other conditioning methods . Even though CFG has also been successfully applied for generating state-only trajectories in Offline-RL , recent work on behavioral cloning suggests that CFG performs significantly worse than simpler conditioning methods for step-based action generation. We provide a detailed analysis of CFG for goal-guided action generation in our experiment section.

One of the main challenges of training the score-based diffusion model is the big range of noise levels $\sigma_{t}\in\{0.001,40\}$ To address this challenge, we use an improved architecture including additional skip-connections and two pre-conditioning layers, which are conditioned on the current noise level $\sigma_{t}$

The conditioning functions are described in detail in Section III of the Appendix and visualized in Figure 1.

These additional skip connections help the score model to scale the output to a wide range of noise levels $\sigma_{t}$ , either by estimating the denoised sample $\boldsymbol{a}_{t-1}$ , directly predicting the noise $\mathbf{\epsilon}$ or something in between these two. Our proposed approach, BESO, integrates a Transformer-based architecture with causal masking as the inner model $F_{\theta}(\boldsymbol{a},\boldsymbol{s},\boldsymbol{g},\sigma_{t})$ . This enables our model to learn temporal relations between observations and actions, thereby improving its overall performance A detailed overview of our proposed architecture is shown in Figure 1. Three linear embedding layers encode the states $\boldsymbol{s}_{n}$ , noise $\sigma_{t}$ and the noisy actions $\boldsymbol{a}_{n}$ into a linear representation of the same dimension, $l_{\boldsymbol{s}}(\boldsymbol{s}),l_{\boldsymbol{a}}(\boldsymbol{a}),l_{\sigma}(\sigma)$ . In addition, the position embedding information is added on the linear representations. The noise embedding is concatenated with the desired future states and all state-noise-action pairs in a large sequence for the model. During training, the denoised actions are inferred for all timesteps in the input series, yet only the last predicted action is utilized for inference. To take advantage of the causal masking in the transformer, we concatenate the goal-sequence before the current observation sequence , allowing for a sequence of goal-states.

V Evaluation

The objective of our experiments was to answer the following key questions: I) Is BESO competitive on goal-conditioned environments against state-of-the-art baselines? II) What are the key components to enable fast sampling of Diffusion policies with good performance? III) Does Classifier-Free Guidance work for goal-conditional behavior synthesis? To answer these questions, we evaluated BESO on several challenging simulation benchmarks. First, we compared the performance of BESO against other state-of-the-art methods. Afterward, we examined BESO’s components with respect to their contribution to the performance.

We compare BESO against several state-of-the-art methods:

Goal-conditioned Behavior Cloning (GCBC) learns a unimodal policy encoded as a simple multi-layer perceptron (MLP) with an trained with an MSE loss .

Relay Imitation Learning (RIL) is a hierarchical policy, that learns a high-level sub-goal generator, which is used to condition a low-level MLP policy .

Latent Motor Plans (LMP) is a hierarchical goal-conditioned policy, which consists of a seq2seq CVAE and an action decoder policy . We use an adapted KL-weighting term and a transformer encoder, which has been shown to improve the performance of LMP .

Conditional Implicit Behavior Cloning (C-IBC) uses an energy-based model as an implicit policy . We use a goal-conditioned extension of IBC to study the importance of the selected generative model architecture.

Conditional-Behavior Transformer (C-BeT) is a GPT-like transformer-based policy, that predicts discrete action labels together with a continuous offset vector to learn multimodal behavior . The action labels are determined a priori via K-means clustering.

Diffusion-X (CX-Diff) is a DDPM based policy with improved inference. It uses stochastic sampling and additional $X$ -extra inference steps at the lowest noise level to synthesize actions in $50$ + $X$ steps. While performing only slightly worse than the closely related KDE-Diff it has a significantly lower computational cost.

To ensure a fair evaluation of all methods we kept the general hyperparameters, e.g., layer size and number, as consistent as possible while tuning the method-specific hyperparameters. A detailed summary of the baseline architectures and hyperparameters is provided in Sec. -C of the Appendix. Additionally, we evaluated all models on the kitchen and block-push task with 10 seeds and 100 rollouts each. Given the high computational costs and time of training models for CALVIN, we restricted the tested methods to 3 seeds and limited the number of baselines.

V-B Simulation Experiments

We evaluated BESO against the baselines on three simulation benchmarks, shown in Figure 2:

CALVIN Benchmark : We used the LfP benchmark, with a dataset consisting of 6 hours of unstructured play data. We restricted all methods to using a single static RBG image as observation input and predicting relative Cartesian actions as output . We evaluated the methods on single tasks and 2 tasks in a row from a single goal image, both variants were conditioned on goal-images outside the training distribution, that did not contain the end-effector in the correct position.

Block-Push Environment : We used the adapted goal-conditioned variant . The Block-Push Environment consists of an XARm robot that must push two blocks, a red and a green one, into a red and green squared target area. The dataset consists of 1000 demonstrations collected by a deterministic controller with 4 possible goal configurations. The methods got 0.5 credit for every block pushed into one of the targets with a maximum score of 1.0.

Relay Kitchen Environment : A multi-task kitchen environment with objects such as a kettle, door, and lights that the agent can interact with. The data consists of 566 human-collected trajectories with sequences of 4 executed skills. We used the same experiment settings as described in to allow for fair comparisons. The models were evaluated using a pre-defined goal state, that consisted of 4 tasks for each rollout. Each correctly completed task gives 1 credit with a maximum of 4.

The methods were evaluated on two metrics: result evaluates how many of the desired goals of each rollout are achieved, while reward measures the overall performance by giving credit for reaching any goal defined in the environment.

V-C Simulation Results

We compared BESO to the baselines on the Relay-Kitchen and Block-Push environments. The results are summarized in Table I. As shown in the table, BESO consistently outperformed the competitors on both tasks across 10 seeds. The low variance of BESO, additionally, indicates the robustness of our approach. Among the baselines, Diffusion-X and C-BeT perform well on the kitchen task and block-push environment, respectively. The diffusion policies excelled, outperforming all other baselines on the kitchen and the block-push task, whereas C-BeT demonstrated comparable performance on the block-push environment. Considering that BESO only used $3$ denoising steps on both environments, compared to the $50(20)+8$ steps of CX-Diff, makes BESO’s performance even more impressive. By contrast, CX-Diff, when limited to $3$ denoising steps, only managed an average result of $2.74(\pm 0.26)$ in the kitchen environment. This highlights the advantage of BESO’s architecture combined with improved noise scheduling and sampler to achieve good results with only 3 denoising steps. On a modern desktop PC, BESO requires around $0.012$ seconds to predict an action, while the CX-Diffusion model needs an average of $0.15$ seconds. This makes BESO over $10$ times faster.

In a more challenging simulation environment, the CALVIN environment, BESO demonstrated its ability to generalize to unseen goal states by achieving the best overall performance on 13 difficult single tasks. Each task was conditioned on a single goal image unseen during training, where the end-effector is not located near the corresponding task. This posed a significant challenge, as the models have to infer changes in the environmental state and perform the necessary tasks without relying on the position of the end-effector in the image for guidance. The results of this experiment are summarized in Figure 4 and the individual success rates of the tasks are summarized in Figure 3. As shown, BESO achieves the best overall performance on individual hard tasks, demonstrating its ability to also generalize to unseen goal-states. RIL is the second-best model and has a slightly better average performance on 2 tasks.

Additionally, the models were evaluated on solving two tasks with a single goal image. Similar to the first task, the end-effector was located at a different position away from both tasks. In this instance, BESO and its Classifier-Free Guidance (CFG) variant once again outperformed other models, though the CFG variant registered a slightly lower performance. The results illustrate that BESO can effectively learn meaningful behavior to solve downstream short-term and long-term goals by learning from random windows of play trajectories. This further supports the conclusion that BESO’s ability to learn multimodal and expressive action distributions is key for effective learning from play. In addition, this experiment showcases BESO’s proficiency in effectively from visual data. Overall, our results indicate that BESO is competitive against state-of-the-art baselines and capable of effectively learning from play data, making it a promising approach for goal-conditioned behavior learning. Hence, we can answer Question I) in the affirmative.

V-D BESO design choices

We answer Question II by evaluating different components of BESO to study their contribution to the overall performance.

Conditioning Method. First, we evaluated different methods to condition the behavior generation on the desired goal state. We tested the FiLM-conditioning and the sequential conditioning method used in C-BeT . FiLM requires additional MLP models, which input the goal and scale the latent representations inside the transformer layers. The sequential conditioning method simply includes desired goal-states at the beginning of our sequence as depicted in the model overview of Figure 1. We tested both conditioning variants using the same transformer score model and evaluated it on the block-push and kitchen environment on 10 seeds. FiLM conditioning resulted in a performance drop compared to the sequential conditioning method from an average result of $0.93$ to $0.91$ and $3.76$ to $3.4$ on the block-push and kitchen environment respectively. Moreover, the FiLM method increase the overall model capacity. Hence, BESO uses the sequential conditioning method.

Sampling Algorithm. BESO generates actions by numerically approximating the reverse ODE with its learned score-model starting from a sample generated from our Gaussian prior distribution $p_{T}$ . We investigated several numerical sampling algorithms used in diffusion research, such as DDIM , DPM , DPM++ , and Heun , to assess their contribution to BESO’s performance. The samplers were evaluated on the block-push and kitchen environments with different number of denoising steps. The results show that the performance gap between the individual samplers is small, with DDIM achieving the best overall performance. Surprisingly, the second-order Heun solver has the worst average performance. Detailed results of this experiment are summarized in Table VII and Table VIII in the Appendix. Overall BESO is robust to the number of sampling steps and chosen sampler type, maintaining a similar performance from $3$ to $50$ inference steps.

Stochastic vs. Deterministic Sampling. Current diffusion literature supports the assumption that stochastic samplers have a better overall performance compared to deterministic samplers . We tested this assumption with respect to step-based action generation. We evaluated the same models with 2 sampling algorithms DPM++(2S) and the Euler sampler , each with and without noise injection. The noise scheduling was performed via the ancestral sampling strategy, as used in the DPPM variant and described in Alg. 3. Experiments were again conducted in all environments. As shown in Table II, the results suggest that the addition of noise does not offer a significant benefit to the action generation of step-based diffusion policies. Stochastic samplers only increase the average performance in the kitchen environment. The discrepancy compared to common diffusion applications such as image synthesis could be rooted in high-dimensional image spaces, making the generation process more difficult and requiring more steps for good results. In these high-dimensional spaces, errors are more likely to occur and accumulate over time. Adding noise during the inference process helps the model to correct errors of the gradient approximation, resulting in a better overall performance . In contrast, step-based action-distributions are significantly lower dimensional than the high-dimensional latent spaces of image generation, hence, the addition of noise does not appear to benefit the average performance of step-based policies, as supported by our experimental results.

Finally, we investigate Question III by evaluating the effect of Classifier Free Guidance (CFG) for step-based action generation with goal-conditioned policies. The results of this experiment, reported in Figure 5, indicate that CFG is an effective method for goal-conditioning in a step-based setting. The average result for the block-push and kitchen tasks is slightly worse than the standard goal-conditioned variant, while the average reward is equal. CFG-BESO is also able to learn effectively in the image-based CALVIN environment and achieves similar performance to the standard goal-conditioned variant. The performance of the CFG-model with $\lambda=0$ demonstrates, that CFG-BESO is capable of learning a well-performing, unconditional policy $\pi\left(\boldsymbol{a}|\boldsymbol{s}\right)$ . The low average result in Figure 5 shows that the policy ignores the goal-state and aims to achieve a high reward solely based on the current state. This gives CFG-BESO a unique advantage over common play-based policies. However, CFG has a trade-off: it slightly lowers the average result for more diverse rollouts. Empirical evaluations suggest the best $\lambda$ value is $1.25$ for most tested environments. Experiments with higher values resulted in a lower average performance in environments with high-dimensional action spaces, indicating instability in the action generation. We hypothesize that the guidance provided by the goal-conditioning is only crucial in certain steps during the rollouts, specifically when the policy is deciding which task to solve.

VI Conclusion

We introduced BESO, a new policy representation for goal-conditioned behavior generation that uses score-based diffusion models. We leveraged the expressiveness and multimodal properties of score-based diffusion models to learn task-agnostic behavior from offline, reward-free play datasets, without requiring hierarchical structures or additional clustering. In addition, we demonstrated the effectiveness of Classifier-Free Guidance for simultaneously learning a goal-dependent and goal-independent policy in a sequential setting. Experiments on several GCIL benchmarks showed that BESO significantly improves upon several state-of-the-art GCIL algorithms. Our ablation studies have demonstrated the key components of BESO that enable fast, deterministic behavior generation. It further outperformed standard DDPM policies with only 3 denoising steps, alleviating prior drawbacks of slow diffusion sampling.

While BESO demonstrates great performance as a standalone policy, it also offers the flexibility to be seamlessly integrated into other hierarchical frameworks as an action prediction policy. Serving as a practical alternative to traditional behavior cloning policies, BESO sets itself apart with distinct features that are inherent to diffusion models. In the future, we aim to extend BESO for language-guided behavior generation, offering more intuitive goal guidance for humans.

VII Acknowledgments

The work presented here was funded by the German Research Foundation (DFG) – 448648559.

References

-A BESO Hyperparameters

A summary of key hyperparameters of BESO is listed in Table III. We observe, that transformer specific-hyperparameters such as the dropout rates require tuning according to the task, while general diffusion hyperparameters remain consistent across different tasks.

Preconditioning. We utilize the preconditioning functions proposed in Karras et al. :

$c_{\text{skip}}=\sigma_{\text{data}}^{2}/(\sigma_{\text{data}}^{2}+\sigma_{t}^{2})$

$c_{\text{out}}=\sigma_{t}\sigma_{\text{data}}/\sqrt{\sigma_{\text{data}}^{2}+\sigma_{t}^{2}}$

$c_{\text{in}}=1/\sqrt{\sigma_{\text{data}}^{2}+\sigma_{t}^{2}}$

Normalization. BESO performs optimally when actions are diffused within a range of $ $with a noise range of$ {0.005,1} $. We adopted this noise range for all three environments, scaling the action output accordingly. For action diffusion with a larger range of, such as$ $, it is advisable to expand the noise range to higher values:$ \{0.4,40\} $for optimal performance. For the input, we recommend normalizing the data with a mean of and a standard deviation of$ 1$.

Training Noise distribution. During training, noise values are sampled from a predefined noise distribution $P(\sigma)$ . The standard distribution used in diffusion literature is the log-normal, introducing two additional hyperparameters $\sigma_{\text{std}},\sigma_{\text{max}}$ , that require additional tuning. Our experiments revealed that the recommended values from prior work are not optimal for action diffusion. Hence, we opted for the log-logistic distribution $\text{LogLogistic}(\alpha=0.5,\beta=0.5)$ , which does not require additional parameters and works well in all our experiments.

Optimization. For optimization, we employed the commonly used Adam or AdamW optimizer for our experiments with a standard learning rate of $1e-4$ . Additionally, we use the Exponential Moving Average (EMA) to optimize our model’s weights.

Time steps. One important choice is the function of time steps, which determines how noise levels are distributed over the discrete steps. Our empirical evaluation summarized in Table IV indicates that exponential time steps are the most effective for BESO on average. However, other discretization methods such as the linear scheduler and Karras scheduler also deliver comparable results and can increase the performance on individual tasks.

Recommendations. We recommend starting with the noise range of $\{0.005,1\}$ for a new task together with exponential time steps and the DDIM solver. To get the best performance, it is worth trying out other samplers such as Euler Ancestral and the linear time steps.

-B Sampler Ablation

We evaluate various state-of-the-art ODE samplers and their SDE counterparts in different environments. To determine the best solver for conditional-behavior generation, we analyze the average performance of 10 different seeds with 100 rollouts each in different environments. In general, we differentiate first-order and second-order solvers: the first order solver is Euler and the tested second order solver is Heun . The tested samplers include:

Euler ODE (Euler): A first-order ODE sampler from without the additional addition and deleting of noise. The algorithm is summarized in 4.

Euler-Ancestral (EA): A continuous-time version of the standard DDPM sampler introduced in .

2nd Order Heun Solver (Heun): A second-order ODE solver using the Heun method .

DPM: An exponential ODE integrator solver designed for synthesis in a few inference steps . We use the second order method.

DDIM: A first order variant of DPM, which has been introduced individually and has been designed for fast inference and CFG.

DPM-Ancestral:A stochastic variant of DPM with ancestral noise injections.

DPM++(2S): An improved version of the second order DPM sampler for classifier-free guidance based conditional diffusion models with a single inference step

DPM++(2M): An improved version of the second-order DPM sampler for classifier-free guidance based conditional diffusion models , which is a second order method using two model predictions per step.

Several previous studies have compared the performance of ODE samplers in the context of image generation . However, these comparisons may not be entirely indicative as image generation tasks have unique challenges and requirements not relevant to action synthesis. To ensure a fair comparison, we evaluated all samplers on the same models across several simulation environments and report their average performance based on 100 runs for each environment. This allows us to accurately compare the effectiveness of each deterministic solver in the context of step-based action generation. The results for the kitchen environment are shown in Table VII and the performance for the block push is reported in Table VIII. As shown in both tables, the first-order exponential integrator solver DDIM achieves the best overall performance. Increasing the number of inference steps does not have a significant impact on the average performance, even reducing the average result of some samplers. Overall the performance differences of all evaluated samplers are small.

-C Baselines Implementation

The MLP-based models have $4$ layers with $512$ neurons and use the ReLU activation function. All diffusion models have the same transformer backbone, and C-BeT uses its recommended parameters. During training, the Adam optimizer was used with a learning rate of $0.001$ for MLP models and $1e-4$ for transformer models. The batch size for MLP models was $512$ , while it was $1024$ for transformer models, except for BeT, which used a batch size of $64$ as recommended in .

GCBC For the GCBC model, the goal is concatenated with the state and fed into the 4-layer MLP architecture with a dropout rate of 0.1.

GC-IBC The GC-IBC model uses the same MLP architecture as GCBC and is optimized using the InfoNCE loss with additional energy-regularization and Wasserstein Gradient loss. During experiments, adding a penalty term with $\lambda=0.005$ to restrict the average energy improved training stability . Given the large number of tunable hyperparameters for IBC, we ran a hyperparameter search to determine the best ones. We want to note, that the model results of EBM were very sensitive to initial seeds and we had trouble getting consistent results for the models. Similar observations of IBC performance have been reported in related work .

C-BeT For the performance of C-BeT, we use the recommended parameters from Cui et al. for all tested environments. Our reported results are marginally worse, than the ones reported in the original work, since they do not average it over 10 seeds.

The LMP model was evaluated on the Kitchen and Block Push environments with extensive hyper-parameter sweeps to find the best-performing configuration. A detailed overview of the sweep parameters and the chosen ones is shown in Table VI. On the CALVIN environment, the proposed parameters from prior work were used . We used the improved LMP variant, called HULC, from , which uses a different Kl-divergence weighting term and a transformer model the Seq2Seq CVAE.

RIL For the low-level policy of kitchen and block push we use 4 layers with 512 neurons each. For the CALVIN task, we use the baseline version from and kept the hyperparameters the same for training.

Diffusion-X The baseline from uses the same hyper-parameters of our transformer model reported in III to guarantee a fair comparison. Diffusion-X uses $50$ inference steps on the kitchen task combined with additional 10 fine-tuning steps at the lowest noise level, while we use $20$ inference steps for the block-push environment and additional $8$ fine-tuning steps. Diffusion-X uses a discrete variant of the Euler sampling method with an ancestral noise scheduler, which is reported in Alg. 3 . Further, it applies $X$ -additional denoising steps at the lowest noise level.