PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence

Introduction

Large language models (LLMs) demonstrate strong reasoning capabilities across various domains, including dialogue Glaese et al. (2022); Thoppilan et al. (2022), step-by-step reasoning Wei et al. (2022); Kojima et al. (2022), math problem solving Lewkowycz et al. (2022); Polu et al. (2022), and code writing Chen et al. (2021a). However, a limitation of such models for inference in the real world is the issue of grounding: while training LLMs on massive textual data may lead to representations that relate to our physical world, connecting those representations to real-world visual and physical sensor modalities is essential to solving a wider range of grounded real-world problems in computer vision and robotics Tellex et al. (2020). Previous work (Ahn et al., 2022) interfaces the output of LLMs with learned robotic policies and affordance functions to make decisions, but is limited in that the LLM itself is only provided with textual input, which is insufficient for many tasks where the geometric configuration of the scene is important. Further, in our experiments we show that current state-of-the-art visual-language models trained on typical vision-language tasks such as visual-question-answering (VQA) cannot directly solve robotic reasoning tasks.

In this paper we propose embodied language models, which directly incorporate continuous inputs from sensor modalities of an embodied agent and thereby enable the language model itself to make more grounded inferences for sequential decision making in the real world. Inputs such as images and state estimates are embedded into the same latent embedding as language tokens and processed by the self-attention layers of a Transformer-based LLM in the same way as text. We start from a pre-trained LLM in which we inject the continuous inputs through an encoder. These encoders are trained end-to-end to output sequential decisions in terms of natural text that can be interpreted by the embodied agent by conditioning low-level policies or give an answer to an embodied question. We evaluate the approach in a variety of settings, comparing different input representations (e.g. standard vs. object-centric ViT encodings for visual input), freezing vs. finetuning the language model while training the encoders, and investigating whether co-training on multiple tasks enables transfer.

To investigate the approach’s breadth, we evaluate on three robotic manipulation domains (two of which are closed-loop in the real-world), standard visual-language tasks such as VQA and image captioning, as well as language tasks. Our results indicate that multi-task training improves performance compared to training models on individual tasks. We show that this transfer across tasks can lead to high data-efficiency for robotics tasks, e.g. significantly increasing learning success from handfuls of training examples, and even demonstrating one-shot or zero-shot generalization to novel combinations of objects or unseen objects.

We scale PaLM-E up to 562B parameters, integrating the 540B PaLM Chowdhery et al. (2022) LLM and the 22B Vision Transformer (ViT) Dehghani et al. (2023) into, to our knowledge, the largest vision-language model currently reported. PaLM-E-562B achieves state-of-the-art performance on the OK-VQA Marino et al. (2019) benchmark, without relying on task-specific finetuning. Although not the focus of our experimentation, we also find (Fig. 2) that PaLM-E-562B exhibits a wide array of capabilities including zero-shot multimodal chain-of-thought (CoT) reasoning, few-shot prompting, OCR-free math reasoning, and multi-image reasoning, despite being trained on only single-image examples. Zero-shot CoT Kojima et al. (2022), originally a language-only concept, has been shown on multimodal data with task-specific programs Zeng et al. (2022) but to our knowledge, not via an end-to-end model.

To summarize our main contributions, we (1) propose and demonstrate that a generalist, transfer-learned, multi-embodiment decision-making agent can be trained via mixing in embodied data into the training of a multimodal large language model. We show that, (2) while current state-of-the-art general-purpose visual-language models out-of-the-box (zero-shot) do not well address embodied reasoning problems, it is possible to train a competent general-purpose visual-language model that is also an efficient embodied reasoner. In studying how to best train such models, we (3) introduce novel architectural ideas such as neural scene representations and entity-labeling multimodal tokens. Finally, in addition to our focus on PaLM-E as an embodied reasoner we (4) show that PaLM-E is also a quantitatively competent vision and language generalist, and (5) demonstrate that scaling the language model size enables multimodal finetuning with less catastrophic forgetting.

Related Work

General vision-language modeling. Building on successes in large language Brown et al. (2020); Devlin et al. (2018) and vision Dosovitskiy et al. (2020) models, recent years have seen a growing interest in large vision-language models (VLMs) Li et al. (2019); Lu et al. (2019); Hao et al. (2022); Gan et al. (2022). Unlike their predecessors, VLMs are capable of simultaneously understanding both images and text, and can be applied to tasks such as visual question answering Zhou et al. (2020); Zellers et al. (2021b), captioning Hu et al. (2022), optical character recognition Li et al. (2021), and object detection Chen et al. (2021b). The methods by which images are integrated varies. For example, Alayrac et al. (2022) augments pretrained language models with a mechanism to directly attend to a single context image. In contrast, PaLM-E represents images and text as “multimodal sentences” of latent vectors, allowing it to process multiple images in a flexible way within any part of a sentence. More closely related to our work is Frozen (Tsimpoukelli et al., 2021) where vision encoder parameters are optimized via backpropagation through a frozen LLM (Lu et al., 2021). Inspired by this work, we investigate the design in a broader scope by introducing alternative input modalities (e.g. neural scene representations), and our proposed approach empirically outperforms Frozen by more than 45%45\% on the VQAv2 benchmark. More importantly we demonstrate that PaLM-E is applicable not only to perceptual but also embodied tasks.

Actions-output models. Prior works focus on combining vision and language inputs in an embodied setting with the goal of direct action prediction Guhur et al. (2022); Shridhar et al. (2022b; a); Zhang & Chai (2021); Silva et al. (2021); Jang et al. (2022); Nair et al. (2022); Lynch et al. (2022); Brohan et al. (2022). Among these methods, VIMA (Jiang et al., 2022) explores multimodal prompts similar to PaLM-E. The role of language is perhaps most aptly described as task specification in these works. In contrast, PaLM-E generates high-level instructions as text; in doing so, the model is able to naturally condition upon its own predictions and directly leverage the world knowledge embedded in its parameters. This enables not only embodied reasoning but also question answering, as demonstrated in our experiments. Among works that output actions, perhaps most similar is the approach proposed in Gato Reed et al. (2022) which, like PaLM-E, is a generalist multi-embodiment agent. In contrast to Gato, we demonstrate positive transfer across different tasks where the model benefits from diverse joint training across multiple domains.

LLMs in embodied task planning. There have been several methods proposed to leverage LLMs in embodied domains. While many works focus on understanding natural language goals Lynch & Sermanet (2020); Shridhar et al. (2022a); Nair et al. (2022); Lynch et al. (2022), fewer consider natural language as a representation for planning – the focus of this work. LLMs contain vast amounts of internalized knowledge about the world Bommasani et al. (2021), but without grounding, generated plans may be impossible to execute. One line of research has employed prompting to elicit a sequence of instructions directly from an LLM either by leveraging semantic similarity between an LLM’s generation and an eligible set of instructions Huang et al. (2022b), incorporating affordance functions Ahn et al. (2022), visual feedback Huang et al. (2022c), generating world models Nottingham et al. (2023); Zellers et al. (2021a), planning over graphs and maps Shah et al. (2022); Huang et al. (2022a), visual explanations Wang et al. (2023), program generation Liang et al. (2022); Singh et al. (2022), or injecting information into the prompt Zeng et al. (2022). In contrast, PaLM-E is trained to generate plans directly without relying on auxiliary models for grounding. This in turn enables direct integration of the rich semantic knowledge stored in pretrained LLMs into the planning process.

With few exceptions, the parameters of the LLMs employed in many of these works are employed as-is without further training. In LID Li et al. (2022), this constraint is relaxed and LLM parameters are finetuned to produce a planning network for generating high-level instructions. (SL)3\text{(SL)}^{3} Sharma et al. (2021) tackles the more challenging task of simultaneously finetuning two LLMs: a planning network, which produces high-level instructions, and a low-level policy network, which selects actions. With PaLM-E, our interests are distinct and complementary: we investigate a generalist, multi-embodiment model, across multiple modalities.

PaLM-E: An Embodied Multimodal Language Model

The main architectural idea of PaLM-E is to inject continuous, embodied observations such as images, state estimates, or other sensor modalities into the language embedding space of a pre-trained language model. This is realized by encoding the continuous observations into a sequence of vectors with the same dimension as the embedding space of the language tokens. The continuous information is hence injected into the language model in an analogous way to language tokens. PaLM-E is a decoder-only LLM that generates textual completions autoregressively given a prefix or prompt. We call our model PaLM-E, since we use PaLM Chowdhery et al. (2022) as the pre-trained language model, and make it Embodied.

The inputs to PaLM-E consist of text and (multiple) continuous observations. The multimodal tokens corresponding to these observations are interleaved with the text to form multi-modal sentences. An example of such a multi-modal sentence is Q: What happened between and ? where ii> represents an embedding of an image. The output of PaLM-E is text generated auto-regressively by the model, which could be an answer to a question, or a sequence of decisions produced by PaLM-E in textual form that should be executed by a robot. When PaLM-E is tasked with producing decisions or plans, we assume that there exists a low-level policy or planner that can translate these decisions into low-level actions. Prior work has discussed a variety of ways to train such low-level policies (Lynch & Sermanet, 2020; Brohan et al., 2022), and we use these prior methods directly without modification. In the following, we describe our approach more formally.

Decoder-only LLMs. Decoder-only large language models (LLMs) are generative models trained to predict the probability p(w1:L)p(w_{1:L}) of a piece of text w1:L=(w1,,wL)w_{1:L}=(w_{1},\ldots,w_{L}) that is represented as a sequence of tokens wiWw_{i}\in\mathcal{W}. Typical neural architectures realize this by factorizing into

where pLMp_{\text{LM}} is a large transformer network.

Prefix-decoder-only LLMs. Since the LLM is auto-regressive, a pre-trained model can be conditioned on a prefix w1:nw_{1:n} without the necessity to change the architecture

The prefix or prompt w1:nw_{1:n} provides the context based on which the LLM continues to predict the subsequent tokens wn+1:Lw_{n+1:L}. This is often used for inference to steer the predictions of the model. For example, the prompt can contain a description of the task the LLM should solve or examples of desired text completions for similar tasks.

Multi-modal sentences: injection of continuous observations. Multi-modal information such as image observations can be injected into the LLM by skipping the discrete token level and directly mapping the continuous observations into the language embedding space X\mathcal{X}. To this end, we train an encoder ϕ:OXq\phi:\mathcal{O}\rightarrow\mathcal{X}^{q} that maps a (continuous) observation space O\mathcal{O} (refer to Sec. 4 for details) into a sequence of qq-many vectors in X\mathcal{X}. These vectors are then interleaved with normal embedded text tokens to form the prefix for the LLM. This means that each vector xix_{i} in the prefix is formed from either the word token embedder γ\gamma or an encoder ϕi\phi_{i}:

Note that a single observation OjO_{j} is usually encoded into multiple embedding vectors. It is possible to interleave different encoders ϕi\phi_{i} at different locations in the prefix to combine, e.g., information from different observation spaces. Injecting the continuous information this way into the LLM reuses its existing positional encodings. In contrast to other VLM approaches (e.g, Chen et al. (2022)), the observation embeddings are not inserted at fixed positions, but instead placed dynamically within the surrounding text.

Embodying the output: PaLM-E in a robot control loop. PaLM-E is a generative model producing text based on multi-model sentences as input. In order to connect the output of the model to an embodiment, we distinguish two cases. If the task can be accomplished by outputting text only as, e.g., in embodied question answering or scene description tasks, then the output of the model is directly considered to be the solution for the task.

Alternatively, if PaLM-E is used to solve an embodied planning or control task, it generates text that conditions low-level commands. In particular, we assume to have access to policies that can perform low-level skills from some (small) vocabulary, and a successful plan from PaLM-E must consist of a sequence of such skills. Note that PaLM-E must determine on its own which skills are available based on the training data and the prompt, and no other mechanism is used to constrain or filter its outputs. Although these policies are language conditioned, they are not capable of solving long-horizon tasks or taking in complex instructions. PaLM-E is hence integrated into a control-loop, where its predicted decisions are executed through the low-level policies by a robot, leading to new observations based on which PaLM-E is able to replan if necessary. In this sense, PaLM-E can be understood as a high-level policy that sequences and controls the low-level policies.

Input & Scene Representations for Different Sensor Modalities

In this section, we describe the individual modalities that we incorporate into PaLM-E, and how we set up their encoders. We propose different architectural choices for each encoder ϕ:OX\phi:\mathcal{O}\rightarrow\mathcal{X} to map the corresponding modality into the language embedding space. We investigate state estimation vectors, Vision Transformers (ViTs) Dosovitskiy et al. (2020); Chen et al. (2022); Ryoo et al. (2021) for 2D image features, and the 3D-aware Object Scene Representation Transformer (OSRT) Sajjadi et al. (2022a). In addition to encoders that represent the input scene globally, we consider object-centric representations that factor observations into tokens that represent individual objects in the scene.

Object-centric representations. Unlike language, visual input is not pre-structured into meaningful entities and relationships: while ViT may capture semantics, the structure of the representation resembles a static grid rather than a collection of object instances. This poses a challenge both for interfacing with LLMs which have been pre-trained on symbols, and for solving embodied reasoning which requires interaction with physical objects. We therefore also explore structured encoders that aim to separate visual inputs into distinct objects before injecting them into the LLM. Given ground-truth object instance masks MjM_{j}, we can decompose ViT’s representation into x1:mj=ϕViT(MjI)x_{1:m}^{j}=\phi_{\text{ViT}}(M_{j}\circ I) for object jj.

Entity referrals. For embodied planning tasks, PaLM-E must be able to reference objects in its generated plan. In many cases, including the majority of our experiments, objects in a scene can be identified in natural language by some of their unique properties. However, there also exist settings where objects are not easily identifiable by language in few words, e.g. if there are multiple blocks on a table of the same color at different locations. For object-centric representations such as OSRT, we label the multi-modal tokens corresponding to an object in the input prompt as follows: Object 1 is . \ldots Object jj is jj>. This enables PaLM-E to reference objects via special tokens of the form obj_jj in its generated output sentences. In this case, we assume that the low-level policies operate on these tokens as well.

Training Recipes

PaLM-E is trained on a dataset of the form D={(I1:uii,w1:Lii,ni)}i=1ND=\left\{\left(I_{1:u_{i}}^{i},w_{1:L_{i}}^{i},n_{i}\right)\right\}_{i=1}^{N}, where each example ii consists of uiu_{i}-many continuous observations IjiI_{j}^{i}, a text w1:Liiw_{1:L_{i}}^{i}, and an index nin_{i}. Despite being a decoder-only model, the text consists of a prefix part up to index nin_{i} that is formed from multi-modal sentences, and the prediction target, which only contains text tokens. The loss function is therefore a cross-entropy loss averaged over the individual non-prefix tokens wni+1:Liiw_{n_{i}+1:L_{i}}^{i}. To form the multi-modal sentences within the model, we have special tokens in the text that get replaced by the embedding vectors of the encoders at the locations in the text of those tokens. We base PaLM-E on the pre-trained 8B, 62B, and 540B parameter variants of PaLM as the decoder-only LLM into which we inject the continuous observations through the input encoders. Those encoders are either pre-trained or trained from scratch, see Sec. 4. We refer to an 8B LLM combined with a 4B ViT as PaLM-E-12B, similarly a 62B LLM + 22B ViT as PaLM-E-84B, and 540B LLM + 22B ViT as PaLM-E-562B.

Co-training across tasks. In our experiments, we investigate the effects of co-training our models on a variety of diverse data. The “full mixture”, see App. A, consists primarily of a diverse set of internet-scale vision-and-language data, from a variety of tasks. The sampling frequencies are set such that only 8.9% of the full mixture is embodied data, and there are several tasks for each embodiment.

Experiments

Our experiments consider diverse robotic (mobile) manipulation tasks across three different robot embodiments, in simulation and with two different real robots. We refer to https://palm-e.github.io for videos showing the capabilities of PaLM-E on those tasks. Although not the focus of our work, we evaluate PaLM-E also on general vision-language tasks such as visual-question-answering (VQA), image captioning, and established language modeling tasks.

We split our experimental investigation into two broad categories. First, we compare the different input representations from Sec. 4 with respect to performance, generalization, and data-efficiency. The second thread of experiments focuses on one architecture, the main PaLM-E version, consisting of a pre-trained ViT and PaLM language model that takes in raw images as the continuous inputs. Here we show that a single model, trained on a mixture of many datasets, across diverse tasks, and across robot embodiments, can simultaneously achieve high performance on all of those tasks. Crucially, we investigate whether co-training on these datasets enables transfer (Fig. 3): despite different tasks and embodiments, the performance on the individual tasks increases by training on the mixture of tasks. We study the influence on performance, generalization, and data efficiency with respect to co-training strategies and model parameter size. Finally, we consider if freezing the LLM and just training the ViT that injects vision into the LLM is a viable path.

As baselines, we consider the state-of-the art visual language model PaLI Chen et al. (2022), which has not been trained on embodiment robot data, as well as the SayCan algorithm Ahn et al. (2022), supplied with oracle affordances.

Our three robot environments (Fig. 1) include a Task and Motion Planning (TAMP) domain where a robot has to manipulate (grasp and stack) objects, a table-top pushing environment, and a mobile manipulation domain. In each domain, PaLM-E is trained on expert data from that domain. In many cases, this is a sparse amount of data per task. The TAMP tasks involve large combinatorics over possible plans, and many decision sequences are infeasible. PaLM-E has to generate plans that consist of multiple steps, with complicated decision boundaries. The multi-object tabletop pushing environment is taken from the publicly available Language-Table dataset Lynch et al. (2022) and is challenging since it includes several objects, large cardinality of language, and complex pushing dynamics. For both the TAMP and Language-Table environment, PaLM-E has to reason about the poses of the objects. It is not sufficient to know which objects are on the table or knowing their rough relationships, the more fine-grained details about the scene geometry are important for solving the tasks. Finally, we consider a mobile manipulation domain similar to SayCan Ahn et al. (2022), where a robot has to solve a variety of tasks in a kitchen environment, including finding objects in drawers, picking them, and bringing them to a human. For all domains we consider both planning and VQA tasks in those environments. For the mobile manipulation and Language-Table environments, PaLM-E is integrated into the control loop to execute the plans in the real world, and has to adjust the plan in presence of external disturbances or failures of the low-level control policies.

2 TAMP Environment

Tab. 7 (appendix) shows planning success rates and VQA performance for the TAMP environment. The LLM is frozen in these experiments (for pre-trained LLM). For the results reported in Tab. 7, the input representations are trained on a dataset containing 96,000 training scenes of solely the TAMP environment, i.e. no other data is part of the mixture. For 3-5 objects in the scene, which is the same number as in the training set, most input representations perform similarly well. However, when increasing the number of objects, it turns out that using a pre-trained LLM improves performance considerably, especially with entity referrals. Furthermore, we show that a 62B LLM shows better out-of-distribution generalization compared to the 8B variant, while a non-pretrained LLM shows basically no out-of-distribution generalization. The SayCan baseline Ahn et al. (2022) utilizes oracle affordance functions and has difficulties solving this environment, since affordance functions only constrain what is possible right now, but are not informative enough for the LLM to construct long-horizon plans in TAMP environments.

Tab. 1 shows results for 3-5 objects when training on 1% of the dataset, which corresponds to only 320 examples for each of the two planning tasks. Here we see that there are significant differences between the input representations, especially for the planning tasks. First, pre-training the LLM is beneficial in the low data regime for state inputs. Second, both ViT variants (ViT+TL, ViT-4B) do not perform well in solving the planning tasks for this little data. However, if we co-train on all other robot environments as well as general vision-language datasets (ViT-4B generalist), then the performance of the ViT-4B more than doubles. This shows a significant transfer effect between different robot embodiments and tasks. Finally, using OSRT as the input representation leads to the best performance here, demonstrating the strengths of 3D-aware object representations. We also observe another instance of transfer here: when we remove the TAMP VQA data and only train on the 640 planning tasks examples, there is a (slight) drop in performance. The state-of-the art vision-language model PaLI Chen et al. (2022) that was not trained on robot data is not able to solve the tasks. We only evaluated it on q2\text{q}_{2} (objects left/right/center on the table) and q3\text{q}_{3} (vertical object relations), since those most resemble typical VQA tasks.

3 Language-Table Environment

Tab. 3 reports success rates on long-horizon tasks from the Language-Table environment Lynch et al. (2022). PaLM-E is integrated into a control loop that takes as input the long-horizon task and the current image, and outputs an instruction for the low-level policy. We see that joint training on internet-scale vision and language results in a more effective model for robot planning, particularly in the few-shot regime with only 10 demos per task. Scaling the 12B model to the 84B model leads to improvements on 2 of 3 tasks. As with the TAMP environment, neither SayCan nor zero-shot PaLI are effective, unable to solve the easiest task tested.

Real Robot Results and Few-Shot Generalization. In Fig. 7, a), we see PaLM-E is capable of guiding a real robot through a multi-stage tabletop manipulation task, while remaining robust to adversarial disturbances. Given the observed image and a long-horizon goal, e.g. “sort the blocks by colors into corners”, PaLM-E outputs language subgoals at 1 Hz to the policies from Lynch et al. (2022), that output low-level robot actions at 5 Hz. Prior work Lynch et al. (2022) instead involved a human in the loop to interactively guide subgoals and corrections. In Fig. 5, b) we see PaLM-E is capable of one-shot and zero-shot learning. Here, we finetuned PaLM-E on 100 different long horizon tasks with a single training example each, e.g. “put all the blocks in the center”, “remove the blue blocks from the line”. We additionally see that PaLM-E can generalize zero-shot to tasks involving novel object pairs (Fig. 7, c) and to tasks involving objects that were unseen in either the original robot dataset or the finetuning datasets, e.g. a toy turtle (Fig. 5, d).

4 Mobile Manipulation Environment

We demonstrate the performance of PaLM-E on challenging and diverse mobile manipulation tasks. We largely follow the setup in Ahn et al. (2022), where the robot needs to plan a sequence of navigation and manipulation actions based on an instruction by a human. For example, given the instruction “I spilled my drink, can you bring me something to clean it up?”, the robot needs to plan a sequence containing “1. Find a sponge, 2. Pick up the sponge, 3. Bring it to the user, 4. Put down the sponge.” Inspired by these tasks, we develop 3 use cases to test the embodied reasoning abilities of PaLM-E: affordance prediction, failure detection, and long-horizon planning. The low-level policies are from RT-1 Brohan et al. (2022), a transformer model that takes RGB image and natural language instruction, and outputs end-effector control commands.

Affordance prediction. We investigate PaLM-E’s performance at affordance prediction, i.e. whether a skill of the low-level policy can be executed in the current environment. This can be formulated as the VQA problem Given . Q: Is it possible to here?. PaLM-E outperforms PaLI (zero-shot), as well as thresholding on value functions trained with QT-OPT (Tab. 4).

Failure detection. For a robot to do closed-loop planning, it is also important to detect failures, as is shown in Huang et al. (2022c). The multi-modal prompt is Given . Q: Was successful?. Tab. 4 shows that PaLM-E outperforms PaLI (zero-shot), as well as a fine-tuned version of CLIP on this dataset. PaLM-E also outperforms the algorithm proposed in Xiao et al. (2022) that leverages two CLIP models trained with hindsight relabeled data. This method has access to more information than our method, and was specifically designed to just solve failure detection on this dataset.

Real robot results: Long-horizon planning. Finally, we use PaLM-E to perform embodied planning end-to-end for mobile manipulation tasks. The prompt structure for this task is Human: Robot: . I see . PaLM-E is trained to generate the next step of the plan, conditioned on the history of taken steps and the current image observation of the scene. After each step is decoded, we map them to a low-level policy as defined in Ahn et al. (2022). This process is done in an autoregressive manner, until PaLM-E outputs “terminate”. We train the model by using the runs from Ahn et al. (2022), which contains 2912 sequences. We qualitatively evaluated the model in a real kitchen and found the model can carry out long-horizon mobile manipulation tasks, even under adversarial disturbances (Fig. 5).

5 Performance on General Visual-Language Tasks

Although it is not the focus of our work, we report in Tab. 5 results on general vision-language tasks, including OK-VQA Marino et al. (2019), VQA v2 Goyal et al. (2017) and COCO captioning Chen et al. (2015). A single, generalist PaLM-E-562B model achieves the highest reported number on OK-VQA, including outperforming models finetuned specifically on OK-VQA. Compared to Tsimpoukelli et al. (2021), PaLM-E achieves the highest performance on VQA v2 with a frozen LLM to the best of our knowledge. This establishes that PaLM-E is a competitive visual-language generalist, in addition to being an embodied reasoner on robotic tasks.

6 Performance on General Language Tasks

Tab. 8 reports the averaged performance of PaLM-E on 21 general language benchmarks for Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks. The notable trend is that with increasing model scale, there is considerably less catastrophic forgetting of language capabilities. As seen in Fig. 6, while for the smallest (PaLM-E-12B) model 87.3% of its NLG performance (relative) has degraded during multimodal training, merely 3.9% have been degraded for the largest model (PaLM-E-562B).

Summary of Experiments & Discussion

Generalist vs specialist models – transfer. As summarized in Fig. 3, we have shown several instances of transfer in this work, meaning that PaLM-E trained on different tasks and datasets at the same time leads to significantly increased performance relative to models trained separately on the different tasks alone. In Fig. 4, co-training on the “full mixture” achieves more than double the performance. In Tab. 9, we see significant improvements in performance if we add LLM/ViT pre-training, and training on the full mixture instead of the mobile manipulation data alone. For the Language-Table experiment in Tab. 3, we observe analogous behaviour.

Data efficiency. Compared to available massive language or vision-language datasets, robotics data is significantly less abundant. As discussed in the last paragraph, our model exhibits transfer, which aids PaLM-E to solve robotics tasks from very few training examples in the robotics domain, e.g. between 10 and 80 for Language Table or 320 for TAMP. The OSRT results show another instance of data-efficiency by using a geometric input representation. A promising opportunity for future work is to combine this with a method benefitting from large-scale visual data.

Retaining language capabilities. We have shown two paths to retain the language capabilities of the model during multimodal training. As one option, freezing the LLM and only training the input encoders is a viable path for building embodied language models, although this approach occasionally struggled for robotics tasks (Tab. 3). As an alternative route, when the whole model is trained end-to-end, the model retains significantly more of its original language performance with increasing model scale (Fig. 6).

Conclusion

We proposed to build an embodied language model by injecting multi-modal information such as images into the embedding space of a pre-trained LLM. Experiments showed that off-the-shelf state-of-the-art vision-language models trained on general VQA and captioning tasks are not sufficient for embodied reasoning tasks, as well as limitations of a recent proposal for grounding language models through affordances. To overcome these limitations, we proposed PaLM-E, a single model that is able to control different robots in simulation and in the real world, while at the same time being quantitatively competent at general VQA and captioning tasks. In particular the novel architectural idea of ingesting neural scene representations (i.e., OSRT) into the model is particularly effective, even without large-scale data. PaLM-E is trained on a mixture of diverse tasks across multiple robot embodiments as well as general vision-language tasks. Importantly, we have demonstrated that this diverse training leads to several avenues of transfer from the vision-language domains into embodied decision making, enabling robot planning tasks to be achieved data efficiently. While our results indicate that frozen language models are a viable path towards general-purpose embodied multimodal models that fully retain their language capabilities, we have also surfaced an alternative route with unfrozen models: scaling up the language model size leads to significantly less catastrophic forgetting while becoming an embodied agent. Our largest model, PaLM-E-562B, showcases emergent capabilities like multimodal chain of thought reasoning, and the ability to reason over multiple images, despite being trained on only single-image prompts.

Acknowledgements

The authors would like to thank, for their advice, help and support: Xi Chen, Etienne Pot, Sebastian Goodman, Maria Attarian, Ted Xiao, Keerthana Gopalakrishnan, Kehang Han, Henryk Michalewski, Neil Houlsby, Basil Mustafa, Justin Gilmer, Yonghui Wu, Erica Moreira, Victor Gomes, Tom Duerig, Henning Meyer, and Kendra Byrne.

References

Appendix A Data Mixture

Tab. 6 shows the dataset and sampling frequency for the “full mixture” as referred to in the experiments. The majority of the data distribution is general vision-language tasks, with less than 10% robot data.

Appendix B Environment Details

The training scenes for the TAMP environment contain 3-5 cube-shaped objects of different sizes, colors and sampled initial poses. Fig. 8 show an example test scene that contains 6 objects.

In the global version, we consider the following three VQA tasks:

q2\text{q}_{2}: object-table relation. Example prompt: Given . Q: Is the red object left, right, or center of the table?. Target: A: The red object is in the center of the table.

q3\text{q}_{3}: object-object relations. Example prompt: Given . Q: Is the yellow object below the blue object?. Target: A: No, the yellow object is not below the blue object.

q4\text{q}_{4}: plan feasibility. Example prompt: Given . Q: Is it possible to first grasp the blue object, then place it on the yellow object, and then grasp the yellow object?. Target: A: No, this is not possible.

p1\text{p}_{1}: grasping. Example prompt: Given . Q: How to grasp the green object?. Target: A: First grasp the orange object and place it on the table, then grasp the green object.

p2\text{p}_{2}: stacking. Example prompt: Given . Q: How to stack the white object on top of the red object?. Target: A: First grasp the green object and place it on the table, then grasp the white object and place it on the red object.

For the object-centric version with entity referrals, all prompts contain the prefix = Obj 1 is . \ldots Obj j is ., and the VQA task q1\text{q}_{1} is about the color of an object. The other tasks (except with the different prefix, and entity referrals), remain the same.

We utilize the planner from Driess et al. (2020) to generate the dataset for the planning tasks. The low-level policies are also obtained with the method of Driess et al. (2020).

B.2 Interactive Language Table

We use the Language-Table real-world tabletop setup and simulated environment from Interactive Language Lynch et al. (2022).

Data collection. For each task, given the long horizon instruction, we prompt a labeler to enter a short horizon command every 4 seconds. We pass the short horizon instructions to an Interactive Language policy trained using the same procedure as in Lynch et al. (2022). The policy executes 40 steps (10Hz for 4 seconds) before requiring another command from the labeler. This is repeated until the labeler determines the long horizon instruction is complete and issues a ’done’ instruction. The data collection procedure for the real world experiments are the same as in simulation.

Train and Evaluation. To train the finetuned versions of these models, we train a pretrained PaLM-E model for 9,000 additional steps, in order to support a data complexity sweep without training several separate models from scratch on slightly different versions of the full mixture. For Tasks 2 and 3 in simulation, we implement an automated reward to measure the success rate, and we evaluate PaLM-E by running 80 rollouts for each task. Given the current image and high level task, PaLM-E issues a text instruction which a trained low-level policy executes for 4 seconds before PaLM-E issues a new text instruction. For Task 1, we use a test-set and report validation accuracy. This is because the task only requires one step to solve, despite being a complicated visual and linguistic processing task and cannot be solved by the low-level policy from the prompt alone.

Appendix C Natural Language Generation and Understanding Results

Appendix D Additional Data for Affordance and Success Detection

Appendix E Image Attribution

The image of the New York Knicks and Boston Celtics in Figure 2 is under the terms CC-by-2.0 (https://creativecommons.org/licenses/by/2.0/), and was posted to Flickr by kowarski at https://www.flickr.com/photos/27728232@N00/8666371367. The egocentric video images are from https://youtu.be/-UXKmqBPk1w, as in Zeng et al. (2022), via permission from creator Cody Wanner.