Language Models as Zero-Shot Trajectory Generators
Teyun Kwon, Norman Di Palo, Edward Johns
Introduction
In recent years, Large Language Models (LLMs) have attracted significant attention and acclaim for their remarkable capabilities in reasoning about common, everyday tasks . This widespread recognition has since led to efforts in the robotics community to adopt LLMs for high-level task planning . However, for low-level control, existing proposals have relied on auxiliary components beyond the LLM, such as pre-trained skills, motion primitives, trajectory optimisers, and numerous language-based in-context examples (Fig. 2). Given the lack of exposure of LLMs to physical interaction data, it is often assumed that LLMs are incapable of low-level control .
However, until now, this assumption has not been thoroughly examined. In this paper, we now investigate if LLMs have sufficient understanding of low-level control to be adopted for zero-shot dense trajectory generation for robot manipulators, without the need for the aforementioned auxiliary components. We provide an LLM (GPT-4 ) with access to off-the-shelf object detection and segmentation models, and then require all remaining reasoning to predict a dense sequence of end-effector poses to be performed by the LLM itself. We also require that the same task-agnostic prompt is used for all tasks, without any in-context examples.
Given these requirements, we studied if a single prompt could be designed to solve a range of tasks taken from the recent literature, such as “open the bottle cap” and “wipe the plate with the sponge”. And through this investigation, we uncovered the underlying principles and strategies that empower LLMs to navigate the complexities of robot manipulation.
Consequently, our contributions are threefold: (1) We demonstrate, for the first time, that a pre-trained LLM, when provided with only an off-the-shelf object detection and segmentation model, can guide zero-shot a robot manipulator by outputting a dense sequence of end-effector poses, without the need for pre-trained skills, motion primitives, trajectory optimisers, or in-context examples. (2) We present several ablation studies which shed light on what techniques and prompts lead to the emergence of these capabilities. (3) We study how, by analysing the trajectory of objects across an image, LLMs can also detect if a task has failed and subsequently re-plan an alternative trajectory.
Related Work
While previous works have made significant strides in leveraging LLMs for various aspects of robotic control , several limitations and dependencies on external modules persist. The core motivation of our work is to investigate whether these limitations are inherent, or if LLMs can be deployed for low-level control, going from language to a dense sequence of end-effector poses. In this section, we provide an overview of the relevant literature and highlight key distinctions between prior approaches and our research focus.
Predefined Motion Primitives: A subset of prior works, including Code as Policies and ChatGPT for Robotics , have predominantly employed LLMs to address the high-level planning aspect of robotic control. These approaches often rely on predefined movement primitives or pre-trained skills (such as SayCan ) to execute lower-level actions, thereby only partially solving the control stack. In contrast, our investigation aims to push these boundaries by demonstrating that LLMs can delve deeper into the control stack, predicting all lower-level actions for the robot autonomously, in the form of a dense sequence of poses for the robot end-effector to follow to complete a given task.
External Trajectory Optimisers: VoxPoser and Language to Rewards have explored the use of LLMs to generate high-reward regions for robot movement, significantly contributing to trajectory planning. However, these methods still necessitate external trajectory optimisers to compute a trajectory, such as cost and reward functions used to evaluate randomly sampled trajectories along with Model Predictive Control (MPC) . Our research deviates from this paradigm by showcasing that LLMs are capable of autonomously shaping and generating their own trajectories, either as lists of end-effector positions and orientations predicted as language tokens, or as the prediction of code snippets that can then generate these trajectories, both of which remove the reliance on external trajectory optimisers.
Use of In-Context Examples: Previous approaches such as VoxPoser , Code as Policies , and SayCan have relied heavily on providing in-context examples to the LLM input. However, these methods can encounter challenges when extrapolating beyond the demonstrated tasks. In contrast, our research illustrates that, even when relying on their internal understanding alone, LLMs exhibit the capacity to comprehend and solve a diverse range of manipulation tasks, thus broadening the scope of applicability and adaptability in the real world and reducing the reliance on human expertise.
Robotics-Specific Pre-Training and Fine-Tuning: Recently, Brohan et al. and Driess et al. demonstrated that a Vision Language Model (VLM) can be combined with a large robotics-related dataset of actions to enable zero-shot language-conditioned control. However, both the VLM weights and the compute capacity to fine-tune them are unavailable to most research groups: therefore, we focus our investigation on widely available LLMs and vision models , and tackle many tasks from the recent literature that require similar or better dexterity than the ones included in the work by Brohan et al. .
In summary, while prior research has made notable strides in harnessing LLMs for robotics, often focusing on specific components of the control stack or relying on external modules, our investigation represents a departure from these paradigms. We explore the potential of LLMs to provide end-to-end solutions, encompassing the entire control stack from language comprehension to the prediction of dense sequences of end-effector poses. Our investigation expands the known capabilities of LLM-guided robotics, offering a promising avenue for enhancing human-robot interaction and task execution.
Problem Formulation
We investigate if an LLM (GPT-4 ) can predict a dense sequence of end-effector poses to solve a range of manipulation tasks. We now explain what the assumptions and constraints are in our investigation, followed by details of our real-world experimental setup, and the tasks used for evaluation. Given this background, we then present our investigation and its results in Sec. 4.
Assumptions and Constraints: We design a task-agnostic prompt to study the zero-shot control capabilities of LLMs, with the following assumptions: (1) no pre-existing motion primitives, policies or trajectory optimisers: the LLM should output the full sequence of end-effector poses itself; (2) no in-context examples: we study the ability of LLMs to reason about tasks given their internal knowledge alone, and no part of any task is explicitly mentioned in the prompt, either in the form of examples or instructions; (3) the LLM can query a pre-trained vision model to obtain information about the scene, but should autonomously generate, parse and interpret the inputs and outputs; (4) no additional pre-training or fine-tuning on robotics-specific data: we focus our research on pre-trained and widely available models, so that our work can easily be reproduced even with limited compute budget.
Real-World Experimental Setup: We run our experiments on a Sawyer robot equipped with a 2F-85 Robotiq gripper. We use two Intel RealSense D435 RGB-D cameras, one mounted on the wrist of the robot, and the other fixed on a tripod, to observe the environment. The wrist-mounted camera captures a top-down view of the environment at the beginning of an episode (Fig. 3), which is used by a vision model if queried by the LLM. We utilise a pre-trained object detection model, LangSAM (based on Grounding DINO and Segment Anything ), and whenever the LLM calls detect_object, we automatically calculate 3-D bounding boxes of the queried objects from the segmentation maps returned by LangSAM using the camera calibration, and provide the bounding boxes to the LLM. The LLM then leverages this visual understanding of the environment to predict a sequence of 4-D end-effector poses (3 dimensions for position, 1 dimension for rotation about the vertical axis), as well as either open_gripper or close_gripper commands. This is then executed by the robot in an open loop, using a position controller to move sequentially between each pose, hence producing a full trajectory. During this execution, we use XMem to track the segmentation maps over the entire duration of the task, which is then later used for detecting if the task was successful or not.
Task Selection: In pursuit of objectivity, we opt to benchmark our zero-shot LLM-guided robotic control against a challenging repertoire of everyday manipulation tasks. We recreated 26 everyday manipulation tasks from recent robotics papers published at leading conferences , often tackled by relying on hundreds of manual demonstrations. This serves as a representative benchmark of real-world challenges, mirroring the complexity and diversity of the tasks encountered in contemporary robotics research. We choose tasks which semantically cover the most representative tabletop robot behaviours expressed in these papers, and success criteria are human-evaluated and designed to mirror those proposed in the original papers. For each combination of task and method in the following experimental sections, we calculate the success rate over 5 randomised positions and orientations of the objects. The task description is provided in natural language to the LLM, after which no additional human feedback or intervention is allowed. The full list of tasks is shown in Fig. 6, and videos are available at https://www.robot-learning.uk/language-models-trajectory-generators.
Prompt Development
Full Prompt: The core motivation of our work is to investigate whether LLMs can inherently guide robots with minimal dependence on specialised external models and components, in order to provide effective and useful insights for the robotics community. Through this investigation, we designed a single task-agnostic prompt for a range of everyday manipulation tasks, which does not require any in-context examples or task-specific guidance. Fig. 4 illustrates the main information flow in our framework, showing how the task-agnostic prompt interfaces with the vision models and the robot.
Through our experiments outlined in this section, our final prompt formulation instructs the LLM to self-summarise and decompose the predicted plan into steps, before generating Python code which, when run by a standard Python interpreter, outputs a dense sequence of poses for the end-effector to follow; this pipeline resulted in the best performance across those we experimented with. We include details fundamental to all tasks, such as coordinate definitions, as well as functions available for the LLM to call, such as detect_object, which returns the calculated 3-D bounding boxes of the queried objects directly to the LLM. We also include instructions which aim to improve the correctness and reliability of the generated trajectories, such as guidance on step-by-step reasoning, code generation, and collision avoidance. The full prompt is shown in Appendix A.
Prompt Ablations: During the design of this full prompt, we identified several challenges when using LLMs for low-level control, without access to other external dependencies. In this section, we now outline these challenges which motivated the final design of the prompt, and accompany them with results from ablation studies conducted across a diverse set of tasks (Fig. 5), where certain parts of the full prompt were removed. We choose a subset of the 26 original tasks for the ablation studies, which we list in Appendix E, that still capture the various manipulation challenges in the full set. The ablated components of the full prompt are shown in Appendix B.
(1) LLMs often require step-by-step reasoning to solve tasks. Prior work has shown that the reasoning capabilities of LLMs can be improved by asking them to break down the task in a step-by-step manner , and adopting this strategy, we prompt the LLM (1) to break down the trajectory into a sequence of sub-trajectory steps, and (2) to include in the plan when to lower the gripper to make contact with an object. We find that, without including these step-by-step reasoning prompts, the LLM often omits key trajectory steps required to execute the task successfully, such as opening or closing the gripper, or aligning the gripper to be parallel to the graspable side of the object, which are not stated explicitly in the prompt. Indeed, the first three columns in Fig. 5 show that prompting the LLM to think step by step resulted in the highest performance increase.
(2) LLMs can be prone to write code which results in errors, both syntactically and semantically. While much improvement has been made in the domain of code generation by LLMs , their outputs can still throw errors, as well as produce undesirable results when executed. In order to mitigate this, and inspired again by the power of LLMs performing an internal monologue with natural language reasoning, we prompt the LLM to document any functions it defines, with their expected inputs and outputs, and their data types. In addition, we include a prompt instructing the LLM to define reusable functions for common motions (for example, linear trajectory from one point to another), to prevent instances where, as a notable example, it would hard-code the height of the gripper inside a function definition, and reuse that function for another sub-trajectory step which should have been executed at a different height. Similarly, we prompt the LLM to name each sub-trajectory step variable with a number to relate it to each of the steps in the high-level trajectory plan, and to minimise the chance of omitting a sub-trajectory step. The effects of removing these prompt components are, again, noticeable (fourth and fifth columns in Fig. 5).
(3) LLMs are trained on limited grounded physical interaction data. Due to the scarcity of grounded physical interaction data in their training corpora , LLMs often fail to take into account possible collisions between the objects being manipulated. We therefore prompt the LLM to pay attention to the dimensions of the objects and to generate additional waypoints and sub-trajectories, which could help with avoiding collisions. We also include in the prompt a specific phrase which we noticed during our investigation was being used frequently by the LLM for its internal reasoning (“clear objects and the tabletop”). Our experiments show that, while removing this particular phrase from the collision avoidance prompt lowered performance (sixth column in Fig. 5), LLMs do possess some inherent understanding of possible collisions between different objects, as they performed well even after removing the entire collision avoidance prompt (tenth column in Fig. 5).
(4) LLMs often fail to reason about complex trajectory shapes. In a manner similar to the first challenge, we employ a two-step strategy, where initially, we explicitly ask the LLM to generate a textual description of the shape of the motion trajectory as internal reasoning (for example, shaking involves a sinusoidal motion), before outputting the actual sequence of poses required to execute this trajectory (in contrast to Challenge (1), where we prompted the LLM to output a more detailed step-by-step trajectory plan). This has been shown to be beneficial in prior work , and indeed this result is also reflected in the eighth column in Fig. 5.
(5) LLMs often fail to reason about how to interact with objects. In our experiments, we found that LLMs often simplified and failed to reason about more intricate details of object interaction, such as realising that some objects require interaction with a specific part (for example, the rim of a bowl, or the handle of a pan). In order to enable the LLM to detect the most suitable object part to interact with, we prompt it to describe the object part in high-level natural language, and we see in the ninth column in Fig. 5 that this results in more tasks being executed successfully.
Full Prompt Evaluation: Here, we now investigate the LLM’s ability to solve zero-shot a range of manipulation tasks, by evaluating the full prompt on the full set of tasks taken from the recent literature. These tasks and their success rates are presented in Fig. 6. Remarkably, our experiments reveal that LLMs, when equipped with an off-the-shelf vision model and no external motion primitives, policies, or trajectory optimisers, do indeed exhibit notable proficiency in executing the majority of these tasks, by directly predicting a dense sequence of end-effector poses. In the original papers from which these tasks are selected , solving these tasks required numerous human demonstrations. As such, these findings underscore the potential of LLMs as intuitive and versatile guides for robotic manipulation that minimise the need for human time and supervision.
Further Investigations
In this section, we detail further ablation studies conducted regarding the modality of the trajectory generation, the extent to which each output modality is executable by the robot, and the ability of LLMs to detect whether a task was executed successfully or not and subsequently re-plan the trajectory. We list the subset of the original tasks used for this set of ablation studies in Appendix E. The prompts for these ablation studies are shown in Appendix C.
(1) How should the final trajectory be represented? In this set of experiments, we explore the optimal way for the LLM to output the sequence of end-effector poses. Specifically, we conduct ablation studies to evaluate whether this should be represented as a list of numerical values or as code for trajectory generation. Fig. 7 shows the distinction between these two different output modes.
The results, summarised in Fig. 8 A, offer valuable insights. Notably, our investigation shows that outputting code that generates the trajectory outperforms predicting the trajectory directly as an explicit list of numerical poses for the end-effector to follow, represented as language tokens (Fig. 7). In particular, we observe that representing trajectories as numerical values or as code yields similar performance for most tasks, with distinctions emerging in cases involving more intricate trajectories (for example, drawing a circle or a five-pointed star), where outputting code that generates such trajectories prevails (60% success rates for code output compared to 10% for numerical output). This suggests a fundamental property of LLMs for control: while not trained on physical interactions and trajectories, their understanding of both code and mathematical / geometrical structures can bridge these two modes of thinking. Once the overall trajectory shape has been identified by the LLM, while it can be challenging to follow it directly in numbers, it is proficient at generating code that itself can follow complex paths.
Additionally, we study whether directly generating a list of numerical poses, or code that then generates the poses itself, leads to executable outputs more often. Giving low-level control to the LLM poses the risk of the robot receiving wrongly formatted outputs that cannot be executed by the robot. Therefore, in this ablation, we investigate how often the output of the LLM is formatted such that it is executable by the robot. We include prompts instructing the LLM to follow a specific format for the trajectory generation (for the former, we require a list between the ⟨trajectory⟩ and ⟨/trajectory⟩ tags without any Python functions, and for the latter, we require any Python code to be between the ```python and ``` tags). Given the output of the LLM, if an error is thrown during automatic parsing according to this format, we provide the LLM with the error message and ask it to correct the output, for up to three times. Measuring the percentage of executable outputs (Fig. 8 B) demonstrates that outputting code results in 100% of executable trajectories, while direct numerical values cannot be parsed even after three self-corrections for some episodes.
(2) How should the LLM output the gripper action? We also investigate the optimal way of letting the LLM control the gripper open or close action: we compare using a binary variable or explicit functions open_gripper, close_gripper. Our results, in Fig. 8 C, demonstrate that the LLM achieves better performance when using explicit functions, while using a binary variable leads to more errors. A notable failure case stemmed from the LLM hard-coding the gripper state to be open in one of the functions it defined for itself, such that when the same function was then used to generate the object approach-and-lift sub-trajectory steps, the gripper failed to close and grasp the object. Having explicit functions to open and close the gripper, on the other hand, allowed a decoupling of these fundamental actions and enabled the correct functions to be called at any time during the overall trajectory generation plan.
(3) Can LLMs recognise unsuccessful trajectories, and adapt their plan? Finally, we delve into the ability of LLMs to recognise and respond to failures during task execution, as shown in Fig. 9. Our experiments demonstrate that, by analysing the numerical trajectories of objects, LLMs can autonomously detect failure outcomes and initiate re-planning to rectify them. We therefore demonstrate that LLMs possess not only the ability to generate trajectories, but also to discern whether they represent successful or unsuccessful episodes, given the tasks requested by the user.
For each of the 5 trials of a task, when a failure is identified, the LLM modifies the original plan to tackle the possible issue. In Fig. 8 D, we demonstrate that this leads to a small improvement in performance, without the need for any human intervention. As a notable example, the LLM always fails at grasping a bowl on its first try (Fig. 6), attempting to grasp it by the centroid (Fig. 10). Through a sequence of two re-planning iterations, however, the LLM adapts its trajectory and then successfully grasps the bowl by its rim, leading to an increase from to in the overall task execution success rate. The prompts for success detection and re-planning are shown in Appendix D.
Conclusions
Can a Large Language Model be employed for zero-shot control of a robot manipulator, going from a task description to a dense sequence of end-effector poses directly, without the use of in-context prompts, predefined skills, or external trajectory optimisers? Answering this question guided our work, and led us to design a large set of experiments to explore the abilities and design choices that enable LLMs to take full control of a manipulator. Our experiments encompassed 26 diverse tasks drawn from the recent literature to provide a comprehensive benchmark.
We have demonstrated that, when provided with the right prompt, LLMs can successfully predict dense sequences of end-effector poses for a range of real-world manipulation tasks. This is achieved under the constraints that the LLM must use a single task-agnostic prompt without any in-context examples, and has access to only off-the-shelf object detection and segmentation vision models, with no other auxiliary components. This raises the assumed limit of the utility of LLMs for robotics, and we hope that our investigations into how to write an LLM prompt for robots will act as a helpful guide for the community wishing to raise this limit further.
Additional details on the prompt and the code used in these experiments are available in the Appendix and on our website at https://www.robot-learning.uk/language-models-trajectory-generators.
Acknowledgements
The authors wish to thank Kamil Dreczkowski, Georgios Papagiannis and Pietro Vitiello for their valuable discussion and feedback during the writing of the paper.