Scaling Robot Learning with Semantically Imagined Experience

Tianhe Yu, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Dee M, Jodilyn Peralta, Brian Ichter, Karol Hausman, Fei Xia

cs.RO cs.AI cs.CL cs.CV cs.LG

Introduction

Though recent progress in robotic learning has shown the ability to learn a number of language-conditioned tasks , the generalization properties of such policies is still far less than that of recent large-scale vision-language models . One of the fundamental reasons for these limitations is the lack of diverse data that covers not only a large variety of motor skills, but also a variety of objects and visual domains. This becomes apparent by observing more recent trends in robot learning research – when scaled to larger, more diverse datasets, current robotic learning algorithms have demonstrated promising signs towards more robust and performant robotic systems . However, this promise comes with an arduous challenge: it is difficult to significantly scale up diverse, real-world data collected by robots as it requires either engineering-heavy autonomous schemes such as scripted policies or laborious human teleoperations . To put it into perspective, it took 17 months and 13 robots to collect 130k demonstrations in . In , the authors used 7 robots and 16 months to collect 800k autonomous episodes. While some works have proposed potential solutions to this conundrum by generating simulated data to satisfy these robot data needs, they come with their own set of challenges such as generating diverse and accurate enough simulations or solving sim-to-real transfer . Can we find other ways to synthetically generate realistic diverse data without requiring realistic simulations or data collection on real robots?

To investigate this question we look to the field of computer vision.

Traditionally, synthetic generation of additional data, whether to improve the accuracy or robustify a machine learning model, has been addressed through data augmentation techniques. These commonly include randomly perturbing the images including cropping, flipping, adding noise, augmenting colors or changing brightness. While effective in some computer vision applications, these data augmentation strategies do not suffice to provide novel robotic experiences that can result in a robot mastering a new skill or generalizing to semantically new environments . However, recent progress in high-quality text-to-image diffusion models such as DALL-E 2 , Imagen or StableDiffusion provides a new level of data augmentation capability. Such diffusion-based image-generation methods allow us to move beyond traditional data augmentation techniques, for three reasons. First, they can meaningfully augment the semantic aspects of the robotic task through a natural language interface. Second, these methods are built on internet-scale data and thus can be used zero-shot to generate photorealistic images of many objects and backgrounds. Third, they have the capability to meaningfully change only part of the image using methods such as inpainting . These capabilities allow us to generate realistic scenes by incorporating novel distractors, backgrounds, and environments while reflecting the semantics of the new task or scene – essentially distilling the vast knowledge of large generative vision models into robot experience.

As an example, given data for a task such as “move the green chip bag near the orange”, we may want to teach the robot to move the chip bag of any colors near many new objects that it has not interacted with, such as “move the yellow chip bag near the peach” (Fig 1). These techniques allow us to exchange the objects from real data for arbitrary relevant objects. Furthermore, they can leave the semantically relevant part of the scene untouched, e.g. the grasp of the chip bag remains, while the orange becomes a peach. This results in a novel, semantically-labelled data point to teach the model a new task. Such a technique can reasonably generate many more examples such as “move the apple near the orange on a wooden desk”, “move the plum near the orange”, or even “place the coke can in the sink”.

In this paper, we investigate how off-the-shelf image-generation methods can vastly expand robot capabilities, enabling new tasks and robust performance. We propose Robot Learning with Semantically Imagened Experience (ROSIE), a general and semantically-aware data augmentation strategy. ROSIE works by first parsing human provided novel instructions and identifying areas of the scene to alter. It then leverages inpainting to make the necessary alterations, while leaving the rest of the image untouched. This amounts to a free lunch of novel tasks, distractors, semantically meaningful backgrounds, and more, as generated by internet-scale-trained generative models. We demonstrate this approach on a large dataset of robotic data and show how a subsequently trained policy is able to perform novel, unseen tasks, and becomes more robust to distractors and backgrounds. Moreover, we show that ROSIE can also improve the robustness of success detection in robotic learning especially in out-of-distribution (OOD) scenarios.

Related Work

Scaling robot learning. Given the recent results on scaling data and models in other fields of AI such as language and vision , there are multiple approaches trying to do the same in the field of robot learning. One group of methods focuses on scaling up robotic data via simulation with the hopes that the resulting policies and methods will transfer to the real world. The other direction focuses on collecting large diverse datasets in the real world by either teleoperating robots or autonomously collecting data via reinforcement learning or scripting behaviors . In this work, we present a complementary view on scaling the robot data by making use of state-of-the-art text-conditioned image generation models to enable new robot capabilities, tasks and more robust performance.

Data augmentation and domain randomization. Domain randomization is a common technique for training machine learning models on synthetically generated data. The advantage of domain randomization is that it makes it possible to train models on a wide variety of data to improve generalization. Domain randomization usually involves changing the physical parameters or rendering parameters (lighting, texture, backgrounds) in simulation models . Others use data augmentation to transformer simulated data to be more realistic or vice-versa . Contrary to these methods, we propose to directly augment data collected in the real world. We operate directly on the real-world data and leverage diffusion models to perform photorealistic image manipulation on this data.

Diffusion models for robot control. Though diffusion models have become common-place in computer vision, their application to robotic domains is relatively nascent. Janner et al. uses diffusion models to generate motion plans in robot behavior synthesis. Some works have used the ability of image diffusion models to generate images and perform common sense geometric reasoning to propose goal images fed to object-conditioned policies . The recent concurrent works CACTI and GenAug are most similar to ours. CACTI proposes to use diffusion model for augmenting data collected from the real world via adding new distractors and requires manually provided masks and semantic labels. GenAug explores the usage of depth-guided diffusion models for augmenting new tasks and objects in real-world robotic data with human-specified masks and object meshes. In contrast, our work generates both novel distractors and new tasks without requiring depth. In addition, it automatically selects regions for inpainting with text guidance and leverages text-guided diffusion models to generate novel, realistic augmentations.

Preliminaries

Diffusion models and inpainting. Diffusion models are a class of generative models that have shown remarkable success in modeling complex distributions . Diffusion models work through an iterative denoising process, transforming Gaussian noise into samples of the distribution guided by a mean squared error loss. Many such models also have the capability for high-quality inpainting, essentially filling in masked areas of an image . In addition, such approaches can be guided by language, thus generating areas consistent with both a language prompt and the image as a whole .

Robot Learning with Semantically Imagened Experience (ROSIE)

In this section, we introduce our approach, ROSIE, an automated pipeline for scaling up robot data generation via semantic image augmentation. We assume that we have access to episodes of state and action pairs demonstrating a robot executing a task that is labelled with a natural language instruction. As the first step of the pipeline we augment the natural language instruction with a semantically different circumstance. For example, given a demonstration of placing an object in an empty drawer, we add “there is a coke can in the opened drawer”. With this natural language prompt, ROSIE generates the mask of the region of interest that is relevant to the language query. Next, given the augmentation text, ROSIE performs inpainting on the selected mask with Imagen Editor to insert semantically accurate objects that follow the augmented text instruction. Importantly, the entire process is applied throughout the robot trajectory, which is now consistently augmented across all the time steps. We present the overview of this pipeline in Fig. 2. We describe the details of each component of ROSIE in the following sections. In Section 4.1, we show how we obtain the mask of the target region using open vocabulary segmentation. In Section 4.2, we discuss two main approaches to proposing prompts used for Imagen Editor, which can be either specified manually or generated automatically with a large language model. In Section 4.3, we discuss how we perform inpainting with Imagen Editor based on the augmentation prompt. Finally, we show how we use the generated data in downstream tasks such as policy learning and learning high-level tasks such as success detection in Section 4.4.

In order to generate semantically meaningful augmentations on top of existing robotic datasets, we first need to detect the region of the image where such augmentation should be performed. To this end, we perform open-vocabulary instance segmentation leveraging the OWL-ViT open-vocabulary detector with an additional instance segmentation head. This additional head predicts fixed resolution instance masks for each bounding box detected by OWL-ViT (similar in style to Mask-RCNN ). In particular, we freeze the main OWL-ViT model and fine-tune a mask head on Open-Images-V5 instance segmentations .

2 Augmentation Text Proposal

Next, we discuss two main approaches to attain the augmentation prompt for the text-to-image diffusion model: hand-engineered prompt and LLM-proposed prompt.

Hand-engineered prompt. The first method involves manually specifying the object to augment. For generating new tasks, we choose objects that lie outside of our training data to ensure that the augmentations are able to expand the data support. For improving robustness of the learned policy and success detection, we randomly pick objects that are semantically meaningful and add them in the prompt to generate meaningful distractors in the scene. For example, in Figure 4 where we aim to generate novel in-hand objects by replacing the original object (green chip bag) with various microfiber cloth, we use the following prompt Robot picking up a blue and white stripe cloth to effectively perform inpainting.

LLM-proposed prompt. While hand-engineered prompt may guarantee the generated data to be out-of-distribution, it makes the data generation process less scalable. Therefore, we propose to leverage the power of large language models in proposing objects to augment. We leverage the rich semantics learned in LLMs to propose a vast list of objects with detailed descriptions of visual features for augmentation. We employ GPT-3 as our choice of LLM to propose the augmentation text. In particular, we specify the original task of the episode and the target task after augmentation in the LLM prompt, and ask the LLM to propose the OWL-ViT prompt for detecting masks of both the target region and the passthrough objects. We present an example of LLM-assisted augmentation prompt proposal in Figure 2, where LLM-generated augmentation text is highly informative, which in turn benefits the text-guided image editing. Therefore, we use LLM-proposed prompts in our experiments. Despite that there is some noise in the LLM-proposed prompts (see Appendix C), it generally does not hurt robotic control performance in practice.

3 Diffusion Model for Text-Guided Inpainting

Given the segmentation mask and the augmentation prompt, we perform text-guided image editing via a text-to-image diffusion model. Herein, we use Imagen Editor , the latest state-of-the-art text-guided image inpainting model fine-tuned on pre-trained text-to-image generator Imagen , though we note that our approach, ROSIE, is agnostic to the choice of inpainting models. Imagen Editor is a cascaded diffusion architecture. All of the diffusion models, i.e., the base model and super-resolution (SR) models (i.e., conditioned on high-resolution 1024 $\times$ 1024 image and mask inputs) are trained with new convolutional image encoders shown in the bottom right corner of Figure 2. Imagen Editor is capable of generating high-resolution photorealistic augmentations, which is crucial for robot learning as it relies on realistic images capturing physical interactions. Moreover, Imagen Editor is trained to de-noise object-oriented masks provided by off-the-shelf object detectors along with random box/stroke masks , enabling inpainting with our mask generation procedure.

4 Manipulation Model Training

Experiments

In our experimental evaluation, we focus on robot manipulation and embodied reasoning (e.g. detecting if a manipulation task is performed successfully). We design experiments to answer the following research questions:

RQ1: Can we leverage semantic-aware augmentation to learn completely new skills only seen through diffusion models?

RQ2: Can we leverage semantic-aware augmentation to make our policy more robust to visual distractors?

RQ3: Can we leverage semantic-aware augmentation to bootstrap high-level embodied reasoning such as success detection?

To answer these questions, we perform empirical evaluations of ROSIE using the multi-task robotic dataset collected in , which consists of $\sim$ 130k robot demonstrations with 744 language instructions collected in laboratory offices and kitchens. These tasks include skills such as picking, placing, opening and closing drawers, moving objects near target containers, manipulating objects into or out of the drawers, and rearranging objects. For more details regarding the tasks and the data used we refer to Brohan et al. .

In our experiments, we aim to understand the effects of both the augmented text and the augmented images on policy learning. We thus perform two comparisons, ablating these changes:

Pre-trained RT-1 (NoAug): we take the RT-1 policy trained on the 744 tasks in . While pre-trained RT-1 is not trained on tasks with the augmentation text and generated objects, it has been shown to enjoy promising pre-training capability and demonstrate excellent zero-shot generalization to unseen scenarios and therefore, should have the ability to tackle the novel tasks to some extent.

Fine-tuned RT-1 with Instruction Augmentation (InstructionAug): Similar to Xiao et al. , we relabel the original episodes in RT-1 dataset to new instructions generated via our augmentation text proposal 4.2 while keeping the images unchanged. We expect this method to bring the text instructions in-distribution but fail to recognize the visuals of the augmented objects.

For implementation details and hyperparameters, please see Appendix A.

To answer RQ1, we augment the RT-1 dataset via generating new objects that the robot needs to manipulate. We evaluate our method and the baselines in the following four categories with increasing level of difficulty.

First, we test the tasks of moving training objects near unseen containers. We visualize such unseen containers in Figure 10 in Appendix B. We select the tasks “move {some object} near white bowl” and “move {some object} near paper bowl” within the RT-1 dataset, which yields 254 episodes in total. We use the augmentation text proposals to replace the white bowl and the paper bowl with the following list of objects {lunch box, woven basket, ceramic pot, glass mason jar, orange paper plate}, which are visualized in Figure 10. For each augmentation, we augment the same number of episodes as the original task.

As shown in Table 1, our ROSIE fine-tuned RT-1 policy (trained on both the whole RT-1 training set of 130k episodes and the generated novel tasks) outperforms pre-trained RT-1 policy and fine-tuned RT-1 with instruction augmentations, suggesting that ROSIE is able to generate fully unseen tasks that are beneficial for control and exceeds the inherent transfer ability of RT-1.

Second, we perform a similar experiment, where we focus on placing objects into the novel target containers, rather than just nearby. Example augmentations are shown in Figure 10. Table 1 again shows ROSIE outperforms both pre-trained RT-1 and RT-1 with instruction augmentation by at least 75%.

Third, we test the limits of ROSIE on novel tasks where the object to be manipulated is generated via ROSIE. We pick the set of tasks “pick green chip bag” from the RT-1 dataset consisting of 1309 episodes. To accurately generate the mask of the chip bag throughout the trajectory, we run our open-vocabulary segmentation to detect the chip bag and the robot gripper as the passthrough objects so that we can filter out the robot gripper to obtain the accurate mask of the chip bag when it is grasped. We further query Imagen Editor to substitute the chip bag with a fully unknown microfiber cloth with distinctive colors (black and blue), with augmentations shown in Figure 4. Table 1 again demonstrates that ROSIE outperforms pre-trained RT-1 and RT-1 with instruction augmentation by at least 150%, proving that ROSIE is able to expand the manipulation task family via diversifying the manipulation targets and boost the policy performance in the real world.

Finally, to further stress-test our diffusion-based augmentation pipeline, we try to learn to place object into a sink. Note that the robot has never collected data for that task in the real world. We generate a challenging scenario where we take the all the RT-1 tasks that perform placing a can into the top drawer of a counter (779 episodes in total) and deploy ROSIE to detect the open drawer and replace the drawer with a metal sink using Imagen Editor (see the first row of Figure 7 for the visualization). Similar to the above two experiments, we dynamically compute the mask of the open drawer at each frame of the episode while removing the robot arm and the can in the robot hand from the mask. Note that the generated sink makes the scene completely out of the training distribution, which poses considerable difficulty to the pre-trained RT-1 policy. The results in the last row in Table 1 confirm this. ROSIE achieves 60% overall success rate in placing the coke can and the pepsi can into the sink whereas the RT-1 policy is not able to locate the can and fails to achieve any success. In Figure 7, we include the visualizations of a trajectory of the original episode with augmentations that replaces the drawer with the sink and a trajectory of the policy rollout performing the task near a real metal sink. Our method effectively learns from the episodes with the sink generated by ROSIE and completes the task that involve the sink in the real kitchen.

Overall, through these experiments, ROSIE is shown to be capable of effectively inpainting both the objects that require rich manipulation and the target object of the manipulation policy, significantly augmenting the number of tasks in robotic manipulation. These results indicate a promising path to scaling robot learning without extra effort of real data collection.

2 RQ2: Robustifying manipulation policies

We investigate RQ2 with two scenarios: policy robustness w.r.t. different backgrounds and new distractors.

We employ ROSIE to augment the background in our training data. We perform two types of augmentations: replacing the table top with a colorful table cloth and inserting a sink on the table top. We select two manipulation tasks, “pick coke can” and “pick pepsi can” from our training set, which consists of 1222 episodes in total. We run open-vocabulary segmentation to detect the table and passthrough objects, which consist of the robot arm and the target can. To generate a diverse set of table cloth during augmentation, we query GPT-3 with the following prompt: inpainting prompt: pick coke can from a red and yellow table cloth goal: list 30 more table cloth with different vivid colors and styles with visual details inpainting prompt: pick coke can from 1. Navy blue and white striped table cloth 2. White and pink polka dot table cloth 3. Mint green and light blue checkered table cloth 4. Cream and gray floral table cloth 5. Hot pink and red floral table cloth ... We show the some example answers from GPT-3 in blue, which are semantically meaningful. We use Imagen Editor to replace the table top except the target can with the LLM-proposed table cloth. To inpaint a sink on the table, we follow the same procedure described in the placing objects into unseen sink task in Section 5.1 except that we inpaint the sink on the table top rather than the open drawer. We present visualizations of such augmentations in Figure 6. We fine-tune the pre-trained RT-1 policy on both the original data and the augmented episodes with generated table cloth and metal sink. As shown in Table 1, ROSIE + RT-1 signifcantly outperforms RT-1 NoAug in 7 out of 8 settings while performing similarly to NoAug in the remaining scenario, achieving an overall 115% improvement. Therefore, ROSIE is highly effectively in robustifying policy performance under varying table textures and background.

To test whether ROSIE can improve policy robustness w.r.t. novel distractors and cluttered scenes, we consider the following two tasks. First, we train a policy solely from the task “pick coke can” and investigate its ability to perform this task with distractor coke cans, which have not been seen in the 615 training episodes. To this end, we employ ROSIE to add an equal number of augmented episodes with additional coke cans on the table (see Figure 8 in Appendix B for visualizations). As shown in Table 1, RT-1 + ROSIE augmentations improves the performance over RT-1 trained with “pick coke can” data only in scenarios where there are multiple coke cans on the table.

Second, we evaluate a task that places a chip bag into a drawer and investigate its ability to perform this task with distractor objects already in the drawer, also unseen during training. This scenario is challenging for RT-1, since the distractor object in the drawer will confuse the model and make it more likely to directly output termination action. We use ROSIE to add novel objects to the drawer, as shown in Figure 9 in Appendix B and follow the same training procedure as in the coke can experiment. Table 1 shows that RT-1 trained with both the original data and ROSIE generated data outperforms RT-1 with only original data. Our interpretation is that RT-1 trained from the training data never sees this situation before and it incorrectly believes that the task is already solved at the first frame, whereas ROSIE can mitigate this issue via expanding the dataset using generative models.

3 RQ3: A Case Study on Success Detection

In this section, we show that ROSIE is also effective in improving high-level robotic embodied reasoning tasks, such as success detection. Success detection (or failure detection) is an important capability for autonomous robots for accomplishing tasks in dynamic situations that may require adaptive feedback from the environment. Given large diversity of potential situations that a robot might encounter, a general solution to this problem may involve deploying learned failure detection systems that can improve with more data. As recent work has shown, visual-language models (VLMs) such as CLIP with internet scale pre-training can be fine-tuned on domain specific robotic experience to perform embodied reasoning such as success detection. However, collecting domain specific fine-tuning data is often expensive, and it is difficult to scale data collection to cover all potential success and failure cases. This challenge is similar to the one of learning a robust policy that we presented in the previous sections, where the dataset of robot data might include data distribution biases that are difficult to correct with on-robot data collection alone.

As a motivating example, consider the experimental setting from Section 5.1 where a large dataset of teleoperated demonstrations was collected for placing various household objects into empty cabinet drawers. A success detector trained on this dataset would require additional priors and/or data to generalize to images of cluttered drawers.

To study this setting, we utilize ROSIE to augment 22764 episodes of placing objects into drawers tasks from the dataset used in and then fine-tune a CLIP-based success detector following the procedure in . Starting from the episodes of robotic placing into empty drawers, we create two augmented datasets with ROSIE to emulate visual clutter: one dataset (A) that includes generated distractor chip bags inside the drawer and one dataset (B) that includes generated soda cans inside the drawer. Both datasets have the same number of episodes as the original dataset. We evaluate the fine-tuned CLIP-based success detector with and without ROSIE-augmented episodes in two datasets: the in-distribution set and the OOD set. Our in-distribution set contains 76 episodes of robot putting green rice chip bag into the drawer and taking it out of the drawer, while the OOD set contains 58 episodes of robot putting (green rice, green japaleno, blue, brown) chip bag into the drawer, but the drawer contains other items, which are not observed in the training set. Note that this OOD set makes success detection particularly challenging as the model can easily be misguided by the cluttered distractors in the drawer and make incorrect predictions even if the robot fails to place the target object into the drawer.

By utilizing increasing amounts of augmentation from ROSIE, we find that learned success detectors become increasingly robust detecting successes and failures in real-world difficult cluttered OOD drawer scenarios in terms of F1 score, as seen in Table 2. Note that our OOD dataset is highly challenging, as discussed above, so that the prior work without augmentations struggles a lot in this setting whereas ROSIE obtains a reasonable performance. Furthermore, we find that the accuracy on the standard, in-distribution tasks remains unchanged. This indicates that ROSIE can be used as a general semantically-consistent data augmentation technique across various tasks such as policy learning and embodied reasoning.

Societal Impact

The model used in this work is a text-guided image generation model, which open many new possibilities for content creation and subsequently many risks. Our approach attempts to minimize many of these risks through a controlled usage of these technologies, by only modifying local patches of images and using narrowly scoped semantic labels. We further follow accepted responsible AI practices, such as regularly inspecting data before training on it, and in general recommend researchers to establish robust inspection and filtering mechanisms when utilizing text-guided image generation models for data augmentation.

Discussion, Future Work, and Conclusion

In summary, we presented ROSIE, a system that uses off-the-shelf text-guided image generation models to vastly expand robotics datasets without any real-world data collection. To accomplish this, we generated new instructions and their corresponding text prompts for alternating the images, enabling robots to achieve tasks that were only seen through the lens of image generation process. We were also able to generate semantically meaningful augmentations of the images, enabling various learned models trained on the data to be more robust with respect to OOD scenes. Lastly, we experimentally validated the proposed method on a variety of language-conditioned manipulation tasks.

Though the method is general and flexible, there are a few limitations of this work that we aim to address in the future. First, we only augment the appearance of the objects and scenes, and do not generate new motions. To alleviate this limitation of not augmenting physics and motions, we could consider mixing in simulation data as a potential source of diverse motion data. Another limitation of the proposed method is that it performs image augmentation per frame, which can lead to a loss in temporal consistency. However, we find that at least for the architecture that we use (Robotics Transformer ), we do not suffer from a performance drop. State of the art text-to-video diffusion models can generate temporally consistent videos but might lose photorealism and physics realism. We speculate that this can cause downstream task learning performance to deteriorate. The trade off between photorealism and temporally consistency remains an interesting topic for future studies. Finally, we use a diffusion model for image augmentation, which is computationally heavy and limits our capability to perform on-the-fly augmentation. As a future direction, we could consider other models such as the mask transformer-based architecture , which is 10x more efficient.

Acknowledgments

We would like to acknowledge Sharath Maddineni, Brianna Zitkovich, Vincent Vanhoucke, Kanishka Rao, Quan Vuong, Alex Irpan, Sarah Laszlo, Bob Wei, Sean Kirmani, Pierre Sermanet and the greater teams at Robotics at Google for their feedback and contributions.

References

Appendices

We take a pre-trained RT-1 policy with 35M parameters and trained for 315k steps at a learning rate of $1\times 10^{-4}$ and fine-tune the RT-1 policy with 1:1 mixing ratio of the original 130k episodes of RT-1 data and the ROSIE-generated episodes with for 85k steps with learning rate $1\times 10^{-6}$ . We follow all the other policy training hyperparameters used in .

To obtain the accurate segmentation mask of the target region of augmentations, we set a threshold for filtering out predicted masks with low prediction scores of both the region of the interest and passthrough objects given by OWL-ViT. In cases where we have multiple detected masks, we always select the one with highest prediction score. Specifically, for experiments where the robot is required to pick novel objects or place objects into novel containers or move objects near unseen containers (Section 5.1), we use a threshold of 0.07 to detect the in-hand objects and the containers while using a threshold of 0.05 to detect passthrough objects, which are the robot arm and robot gripper. In experiments where the robot is instructed to place the coke can or the pepsi can into the unknown sink or pick up coke can and the pepsi can with new background , we use a threshold of 0.04 to detect the table with all objects and a threshold of 0.03 to detect the passthrough objects, which are the robot arm, robot gripper and the coke can or the blue can in this case. In experiments discussed in Sections 5.2 and 5.3, we use the threshold of 0.3 to detect the table or the open drawer where we want to add new distractors.

For generating LLM-assisted prompts, we perform 1-shot prompting to the LLM. For example, in the setting of generating novel distractors in the task where we place objects into the drawer (Section 5.2), we use the following prompt to the LLM:

Source task: place pepsi can on the counter Target task: place pepsi can on the clutter counter ViT region prompt: empty counter passthrough object prompt: robot arm, robot gripper inpainting prompt: add a chip bag on the counter Source task: place coke can into top drawer Target task: place coke can into cluttered top drawer

and LLM generates the following prompt for detecting masks and augmentations (light blue means LLM generated):

ViT region prompt: empty drawer passthrough object prompt: robot arm, robot gripper inpainting prompt: add a box of crackers in the drawer

which is semantically meaningful for performing mask detection and Imagen Editor augmentation. We follow this recipe of prompting for all of the tasks in our experiments.

During inpainting, we take the checkpoint of Imagen Editor 64x64 base model and the 256x256 super-resolution model trained in and directly run inference to produce augmentations.

During evaluation, for the tasks that perform moving objects near novel containers and grasping unseen microfiber cloth, we perform 10 policy rollouts per new container/microfiber cloth of each method. For tasks that perform placing objects into novel containers, we perform 8 policy rollouts per new container for each method. For the task where the robot is instructed to place coke can or pepsi can into the unseen kitchen sink, for each method, we perform 5 policy rollouts for coke can and pepsi can respectively. For the task where the robot is instructed to grasp the coke can and the pepsi can in new backgrounds, we evaluate each method with 10 rollouts. For the task where the robot places the object into the cluttered drawer, we perform 10 policy rollouts per object for each method. Finally, for the task that requires the robot to pick up coke can in a scene with multiple coke cans, we perform 27 policy rollouts for each approach.

A.2 Computation Complexity

We train our policy on 16 TPUs for 1 day. For obtaining segmentation masks, we perform inference of OWL-ViT on 1 TPU for 1 hour to generate 1k episodes. During augmentation, we perform inference of Imagen Editor using 4 TPUs of the 64 x 64 base model and the 256 x 256 super-resolution model respectively for 2 hours to generate 1k episodes.

B Examples of Augmentations

We include more visualizations of augmentations generated by ROSIE in this section. In Figure 10, we show the generated episodes of ROSIE where we inpaint novel containers in the scene, which are used in the Learning to move objects near generated novel containers and Learning to place objects into generated unseen containers experiments in Section 5.1.

In Figure 8 and Figure 9, we visualize augmented episodes with new distractors, e.g. cluttered coke cans on the table and chip bags in the empty open drawer. These augmentations correspond experiments conducted in Section 5.2.

We also visualize the attention layers in RT-1 when training on our augmented data. As seen in Fig. 11, there are attention heads focusing on our augmented objects, which indicates the augmentation seem to be effective.

Overall, note that ROSIE is able generate semantically realistic novel objects and distractors in the manipulation setting. For example, ROSIE-generated objects typically has realistic shades on the table or the drawer, which is beneficial for training manipulation policies on top of such data.

C Failure Cases of Generated Prompts and Images

While our LLM-assisted prompts generally work very well, we would like to note that it requires few-shot prompting to work well. In the zero-shot case, LLM would just hallucinate and output unuseful augmentation prompts. For example, if we provide the following zero-shot prompt: Source task: pick coke can on a table Target task: pick coke can near a sink Goal: replace the scene in the source task with the scene in the target task inpainting prompt: and LLM gives the following response: Pick up the coke can near the sink, replacing the one originally on the table ,which is not correct. Therefore few-shot prompting is crucial in ROSIE.

We show the failure cases of the augmented images in Figure 12. For the two examples on the left, ROSIE is supposed to generate woven basket and glass mason jar respectively, but it fails to generate such containers and instead generate some bowl-shape containers. For the two examples on the right, ROSIE is supposed to replace the in-hand green chip bag with blue microfiber cloth and a yellow rubber duck respectively. However, as the mask of the in-hand object becomes irregular, the performance of ROSIE degrades and ROSIE is unable to generate blue microfiber cloth and the yellow rubber duck in full shape and half of the in-hand object remains as the green chip bag. We suspect that with fine-tuning Imagen Editor on robotic datasets that show more manipulation-related data, we can improve the generation results drastically. Note that while the generation could be suboptimal at times, our insight is that such imperfect generation can only lead to misalignment between the task instruction and images, which may not have a big negative impact on the policy results and could give extra data augmentation benefit for free. Our policy performance in Section 5 validates this insight to some degree.