Large Language Models for Robotics: A Survey

Fanlong Zeng, Wensheng Gan, Zezheng Huai, Lichao Sun, Hechang Chen, Yongheng Wang, Ning Liu, Philip S. Yu

Introduction

Humans possess exceptional proficiency in executing intricate and dexterous manipulation skills by integrating tactile, visual, and other sensory inputs. Research in the field of robotics aspires to imbue robots with comparable manipulation intelligence. Although recent advancements in robotics and machine learning have yielded promising results in visual mitigation and exploration learning for robot manipulation, there remains much to be accomplished in this area. Large language models (LLMs), such as BERT , Roberta , GPT-3 , GPT-4 , have emerged as significant research achievements in the field of artificial intelligence (AI) in recent years. Through deep learning techniques , LLMs can be trained on massive text corpora, enabling them to generate high-quality natural language text. This development has sparked new thinking in natural language processing and dialogue systems. At the same time, the rapid advancement of robotics technology has created a demand for more intelligent and natural human-machine interaction. Combining LLMs with robots can provide robots with stronger natural language understanding and generation capabilities, enabling more intelligent and human-like conversations and interactions.

Applying LLMs to the field of robotics has important research significance and practical value. Firstly, LLMs can significantly enhance a robot’s natural language understanding and generation capabilities. Traditional robot dialogue systems often require manual rules and template writing, making it difficult to handle complex natural language inputs. LLMs, on the other hand, can better understand and generate natural language by learning from massive text corpora, enabling robots to have more intelligent and natural conversation abilities. Secondly, LLMs can provide more diverse conversation content and personalized interaction experiences. Through interaction with LLMs, robots can generate varied responses and personalize interactions based on user preferences and needs. This helps improve user satisfaction and interactions. In addition, the combination of LLMs and robots contributes to the advancement of artificial intelligence and robotics technology, laying the foundation for future intelligent robots (or called smart robots).

Currently, many research teams and companies have begun exploring the application of LLMs in the field of robotics. Some research focuses on using LLMs for natural language understanding in robots. By using pre-trained language models , robots can better understand user intentions and needs . Other research focuses on using LLMs for natural language generation in robots. Robots can generate fluent and coherent natural language responses through interaction with language models. Furthermore, some research explores how to combine LLMs with other technologies, such as knowledge graphs and sentiment analysis, to further enhance robot dialogue capabilities and user experiences. From multiple perspectives, LLMs-based robotics is one of the most promising paths to achieve embodied intelligence in the future.

Although the combination of LLMs and robots has many potential advantages, it also faces challenges and issues . Firstly, training and deploying LLMs require substantial computing resources and data, which can be challenging for resource-limited robot platforms . Secondly, LLMs may generate inaccurate, unreasonable, or even harmful content when generating natural language text. Effective filtering and control mechanisms are necessary to ensure that the content generated by robots complies with ethical and legal requirements . Additionally, robot dialogue systems need to address challenges such as multi-turn dialogues, context understanding, and dialogue consistency to provide more coherent and human-like interactions. Furthermore, the shape of robots has not been standardized across the industry. The question remains whether robots should adopt a humanoid form or take on a different shape . In other words, what form of robot is best suited for our needs? The impact of embodied intelligence on our society cannot be overstated. Will robots eventually replace human labor? How should we respond to this seismic shift in the future? Moreover, if robots were to gain consciousness, should we still view them as tools? How should humans define a conscious robot?

In conclusion, the applications of large language models in robotics hold tremendous potential. They provide new paradigms and methods for robot control, path planning, and intelligence. Through more intuitive and natural human-machine interaction, language-based path planning, and intelligent semantic understanding, large language models not only enhance the performance and efficiency of robots but also improve the experience and interaction modes of human-robot interaction.

Therefore, this comprehensive review aims to summarize the applications of LLMs in robotics, delving into their impact and contributions to key areas such as robot control, perception, decision-making, and path planning. To summarize, there are four key contributions in this paper, as follows:

We discussed the latest advancements in LLMs and their significant impact on the field of robotics. We highlighted the benefits of LLMs for robots, as well as the emergence of new robot models equipped with LLMs in recent years.

We discussed the current state of robot technology, focusing on advancements in perception, decision-making, control, and interaction combined with LLMs. Specifically, we highlighted the critical role of LLMs in decision-making modules, which have enabled robots to make more informed and effective decisions in various applications.

We explored potential applications of current robots equipped with LLMs in the near future.

We discussed several potential challenges that robots may face when integrated with LLMs, as well as the potential impact of future developments in this field on human society.

Organization: The rest of this article is organized as follows. In Section 2.1, we discuss related concepts of the LLMs and robotics. In Section 2.3 we introduce the new robot models equipped with LLMs in recent years. In Section 3, we indicate the practical guide for technical points. We also introduce the application in Section 4. Moreover, We highlight the challenges in Section 5 and present several promising directions of LLMs for robotics in Section 6. Finally, we conclude this paper in Section 7.

Robotics Based on LLMs

Amidst the swift progress and extensive proliferation of LLMs, the model of robotics based on LLMs has emerged. LLM serves as a robotics brain like in Figure 1, making it more intelligent. In this part, we first review the basic concept of LLMs and the popular LLMs nowadays. After that, we describe the benefits of robotics combined with LLM. Finally, we introduce the recent robotics model based on LLMs and the Transformer designed for robotics below. We also summarize the abbreviation used in this paper in Table 1 for convenience.

We first provide an overview of LLM, starting with an introduction to some fundamental concepts. We then delve into the history of LLM’s development, followed by a brief discussion of its growing popularity in recent years. A language model is a computational model that utilizes statistical methods to analyze and predict the probability of word sequences in a given language. It is designed to capture the patterns, grammar, and semantic meaning of natural language .

N-gram models are a simple form of language models that calculate the probability of a word based on the preceding (n-1) words. They are widely used due to their simplicity and efficiency. The accuracy of the N-gram model is directly related to the length of the context used, with larger ‘n’ values leading to higher accuracy .

Unigram models are often employed for various language processing tasks including information retrieval. It evaluates each word or term independently. It is calculated without considering any conditional context, only the probability of the current word itself appearing.

Bidirectional models differ from unidirectional models, it analyzes text in both directions: backward and forward. This dual approach is commonly employed in various machine learning models and speech generation applications. Bidirectional models harness the power of contextual information from both directions, providing a deeper understanding of the text .

Exponential models employ an equation that combines feature functions and n-grams to evaluate text. Unlike n-grams, this type of model allows for more flexibility in analyzing parameters and does not mandate the specification of individual gram sizes. Essentially, exponential models define features and parameters based on the desired outcomes, providing a more open-ended approach to text analysis.

Neural language models, including recurrent neural networks (RNNs) and transformers , have gained popularity in recent years. These models use deep learning techniques to capture complex language patterns and dependencies.

Transformer architecture’s development revolutionized language modeling. Transformers use self-attention mechanisms to capture relationships between words in a sentence. This is currently the most popular architecture .

1.2 Development of LLMs

Some well-known developments in LLMs are described below in detail.

Eliza. The concept of language generation models originated in the 1960s with the development of Eliza, the world’s first chatbot, by MIT researcher Joseph Weizenbaum. Eliza’s creation laid the groundwork for natural language processing (NLP) research, paving the way for subsequent advancements in this field .

LSTM. The year 1997 witnessed the emergence of Long Short-Term Memory (LSTM) networks, introducing a significant advancement in neural network architecture. The introduction of LSTM networks enabled the development of deeper and more intricate neural networks capable of effectively processing vast amounts of data.

Stanford coreNLP. In 2010, Stanford’s CoreNLP suite brought about a significant milestone in the field by offering developers a versatile toolkit. This suite empowers developers to conduct various natural language processing tasks.

Google brain. In 2011, a scaled-down version of Google Brain surfaced, introducing groundbreaking features such as word embeddings. These advanced capabilities revolutionized natural language processing (NLP) systems by enhancing their ability to comprehend context with greater clarity.

Transformer models. Transformer models , introduced in 2017, brought significant advancements to language modeling. They employ self-attention mechanisms to capture global dependencies and have achieved state-of-the-art performance in various natural language processing tasks.

Large language model. OpenAI unveiled GPT-4 , a language model boasting an astounding scale of approximately one trillion parameters. This represents a five-fold increase compared to its predecessor, GPT-3 , and a staggering 3,000-fold increase compared to the initial release of BERT . The introduction of GPT-4 sets a new benchmark in the field of language models, showcasing the remarkable progress in model size and capacity.

1.3 Popular LLMs

Until now, there are many foundation models or LLMs have been developed. We present some selected models below, including GPT-3.5, GPT-4, BERT, T5, and LLaMA.

GPT-3 (Generative pre-trained transformer 3) . Developed by OpenAI, GPT-3 is one of the most prominent language models. With 175 billion parameters, it can generate coherent and contextually relevant text across a wide range of domains.

GPT-4 (Generative pre-trained transformer 4) . Unveiled on March 14, 2023, GPT-4 represents a significant advancement in language models, prioritizing factual accuracy and enhancing reliability compared to its predecessors, GPT-3 and GPT-3.5. Notably, GPT-4 introduces multimodal capabilities, enabling it to process images as input and generate comprehensive descriptions, classifications, and analyses across different modalities. This multimodal functionality expands the model’s versatility and enhances its ability to understand and generate content across various media formats.

BERT (Bidirectional encoder representations from transformers) . Developed by Google, BERT introduced the concept of pre-training and fine-tuning for language understanding tasks. It has achieved remarkable results in tasks such as question answering and text classification.

T5 (Text-to-Text transfer transformer) . Developed by Google, T5 is a versatile language model that can be fine-tuned for various natural language processing tasks, including summarization, translation, and text generation.

LLaMA . Developed by Google, LLaMA is a language model pre-trained and fine-tuned generative text model with parameter counts ranging from 7 to 70 billion. LLaMA removes the absolute position embedding and instead adds rotational position embedding at each layer of the network.

2 Benefits of LLM for Robotics

The advent of LLM-based robots has brought about a plethora of innovative changes to the field. Here, we explore the various benefits that LLM will bring to robots. The necessity and significance of LLMs for robotics can be summarized in the following ten points:

Natural language interaction. LLMs provide robots with the ability to engage in natural language interactions, allowing users to communicate with robots in an intuitive and convenient manner. This interaction method aligns better with human habits and needs, enhancing the usability and acceptance of robots.

Task execution. LLMs assist robots in performing various tasks by understanding and generating natural language instructions. Robots can navigate, manipulate objects, and execute specific actions based on user language commands . This opens up broader possibilities for robot applications in everyday life.

Knowledge acquisition and reasoning. LLMs possess powerful information retrieval and reasoning capabilities, which can help robots acquire and process rich knowledge. Robots can interact with language models to obtain real-time and accurate information, thereby improving their decision-making ability and intelligence.

Flexibility and adaptability. The flexibility of LLMs enables robots to adapt to different tasks and environments. Through interaction with language models, robots can make flexible adjustments and self-adaptation based on specific circumstances, better meeting user needs .

Learning and improvement. LLMs enable continuous learning and improvement through interaction with users. By analyzing and understanding user feedback, robots can enhance their performance and proficiency. This learning and improvement capability allows robots to gradually adapt to user personalities and preferences, providing more personalized services.

Multimodal interaction. LLMs also support multimodal interaction, enabling robots to process different forms of inputs such as speech, images, and text simultaneously. This multimodal capability allows robots to comprehensively understand user needs and provide richer interaction experiences.

Education and entertainment. LLMs offer potential applications for education and entertainment purposes in robotics. Robots can provide educational content, answer questions, or engage in games and entertainment activities through interaction with language models. This has significant implications for children’s education, language learning, and the entertainment industry.

Emotional interaction. The application of LLMs enhances the emotional interaction capabilities of robots. By generating emotionally responsive outputs, robots can establish closer and more meaningful relationships with users. This emotional interaction is valuable in fields such as care robots, emotional support, and psychotherapy.

Collaboration and cooperation. LLMs enable robots to collaborate and cooperate better with humans. Robots can jointly solve problems, formulate plans, and execute tasks through interaction with language models . This collaboration and cooperation ability is significant for industrial automation, team collaboration, and human-robot coexistence.

Innovation and exploration. The application of LLMs stimulates innovation and exploration in the field of robotics. Through interaction with language models, robots can possess higher-level intelligence and comprehension abilities, opening up new avenues for research and development in robotics.

3 Robotics Based on LLMs

In this subsection, we introduce the smart robotics based on LLMs in recent years. LLMs are used as brains in the part of robotics. First, we summarize the models in recent years in Table 2.

With the increasing popularity of LLMs, people have begun to wonder whether these models can be used to assist robots in performing various daily tasks. However, there are challenges in enabling robots to extract knowledge from LLMs and interact with the physical world. LLMs contain valuable semantic information about the real world, aiding robots in understanding natural language. Nonetheless, giving LLMs a physical form capable of interacting and making real-world decisions is challenging due to their lack of experience with physical objects and environments. PaLM-SayCan can function as the physical embodiment of LLM, utilizing LLM’s semantic capabilities to process natural language instructions. PaLM-SayCan enables robots to execute tasks assigned by humans through the value function. PaLM-SayCan features pre-trained meta-actions controlled by visual motors, while BC-Z and MT-Opt are employed to learn language-conditioned BC and RL policies, respectively. LLM can decompose received natural language instructions into smaller, manageable tasks. Based on the current status, capabilities, and surrounding environment of the robot, actions can be flexibly executed. To determine the feasibility of an action, PaLM-SayCan relies on a logarithmic estimation of the value function and affordance function. It will perform the most likely action to succeed in the current environment and state. For instance, upon receiving the instruction, “Can you help me get an apple?". LLM may decompose it into several tasks: “walking to the kitchen, opening the refrigerator, obtaining the apple, and delivering it to the requester.", just like in Figure 2(a).

3.2 PaLM-E

While LLMs have demonstrated remarkable capabilities in handling complex tasks, integrating them as an interface into robots remains a significant challenge. A major limitation of LLMs is their reliance on text input, which is insufficient for robots that require physical interaction. PaLM-E boasts an LLM capable of integrating continuous sensory information from the real world, effectively bridging the gap between language and perception. Its multi-modal input encompasses vision, text, and state estimation, like in Figure 2(b), as exemplified by the question "What is it in $<$ img_1 $>$ ?" The model’s processing is end-to-end, whose performance is state-of-the-art in OK-VQA . PaLM-E is a visual-language generalist. PaLM-E treats images and text as multi-modal inputs represented by latent vectors. PaLM-E is a decoder-only model that generates text completions autonomously when provided with a prefix or hint. The output of PaLM-E is separated into two parts: when tackling text generation tasks (such as embedded question answering or scene description), the model directly produces the final output (i.e., output text or speech). In contrast, when utilized for specific planning and control tasks, PaLM-E generates low-level instruction text (e.g., instructions for controlling robot meta-actions).

3.3 LM-Nav

Goal-based robot navigation can leverage large, unlabeled datasets for training, resulting in strong generalization capabilities in real-world scenarios. However, in vision-based settings, specifying targets often requires images. Current supervised learning methods are not only expensive but also demand linguistically described and labeled trajectory data, making them impractical for widespread use. How can users communicate with robots more conveniently? To address the challenge, LM-Nav was developed, exploiting the advantages of language to facilitate effective communication between users and robots. The LM-Nav system comprises three components: a vision-navigation model (VNM); a visual-language model (VLM); and a large-scale language model (LLM). Notably, LM-Nav operates without the requirement of labeled data or fine-tuning. By leveraging the VLM and VNM, LM-Nav can extract landmark names from commands and navigate to specified locations. LM-Nav leverages three pre-trained models to achieve successful navigation in pre-explored environments. First, it employs ViNG as a VNM creates a topological map using observations from a prior exploration of the environment. Subsequently, GPT-3 serves as the LLM processes free-form text instructions to determine the target landmark. Finally, CLIP serves as the VLM to locate the corresponding position in the topology map based on the identified landmark. By combining these models, LM-Nav can effectively follow natural language instructions to complete navigation tasks.

3.4 Expedition A1

Expedition A1https://www.agibot.com, developed by AGIBot, embodies the company’s commitment to seamlessly integrating advanced AI into robotics and fostering harmonious collaboration between humans and machines. Envisioning a future where robots serve as indispensable assistants to humans, AGIBot’s mission is to create intelligent and versatile robots capable of unlocking limitless productivity. The company’s founding ethos is centered around the belief that "intelligent robots can create unlimited productivity" when designed to parallel human flexibility and intelligence. Expedition A1 is a humanoid robot equipped with reflex knee joints, designed to resemble a human form. This design choice stems from the fact that most work environments are currently tailored for human functionality. Humanoid robots are allowed to seamlessly integrate and function without requiring significant environmental modifications. A key advantage of humanoid robots is their strong generalization capabilities, enabling them to adapt to diverse situations. While the Expedition A1 can also swap out components, such as replacing legs with tires, mimicking human movement and perception remains a significant challenge for robots. Expedition A1 integrates cutting-edge perception, control, and decision-making technologies, incorporating both a state-of-the-art language model and an independently developed visual model. Designed with industrial manufacturing in mind, it boasts 49 degrees of freedom, surpassing the limitations of traditional robots with only 20 degrees of freedom. Its high degree of freedom enables it to meet various industrial manufacturing requirements. The Expedition A1 is also modular, allowing for autonomous component replacement. For instance, PowerFlow is a joint motor for enhanced flexibility, while SkillHand features vision-based fingertip sensors for precision manufacturing scene design. In addition to its robust hardware, the Expedition A1 utilizes LLM as its brain, complemented by EI-Brain’s embodied intelligence framework. This framework divides the robot’s system into different levels of management, including Expedition A1’s super brain in the cloud, local brain, cerebellum, and brainstem, each corresponding to diverse task levels.

4 New Transformer Architecture for Robotics

In this part, we introduce the Transformer designed for robotics. We summarize the Transformer for robotics in recent years in Table 3.

Reinforcement learning methods struggle to effectively tackle long-horizon tasks like navigation, but from a different angle, sample-based path planning techniques can discover collision-free paths without the need for learning in a known environment. Control Transformer (CT) utilizes a sample-based probabilistic road map (PRM) planner to generate conditional sequences from low-level policy, enabling it to complete navigation tasks solely through local information. CT has been shown to be effective in complex terrain and unknown environments through relevant experiments. By leveraging local observations, CT can solve long-horizon and robot navigation tasks. Following training, CT can obtain a policy and complete navigation from partially observed or unknown environments. CT is a Transformer framework designed to model conditional sequences generated by robot actions. It utilizes a learnable value function to assess the initial cost of reaching the target position and guides the sequence modeling and generation process of the Transformer. To facilitate learning from data collections guided by sampling, the CT problem is treated as a sequence modeling problem with a goal-oriented approach. In essence, CT processing involves auto-regressively predicting actions within a sequence.

4.2 Q-Transformer

Many proposed high-capability machine learning models rely on supervised learning, but their performance is limited by the quality of human demonstrations. Neither the full potential of the hardware nor the required experience can be obtained automatically (given the availability of unlabeled datasets). Reinforcement learning can address these limitations, but training Transformer-based models using reinforcement learning has proven challenging at large dataset sizes. To integrate reinforcement learning and Transformer , Q-Transfomer is proposed. It combines the Transformer structure with offline reinforcement learning, enabling the exploitation of Q-values for each dimension. This is achieved by utilizing a Transformer-based architecture that leverages offline reinforcement learning to extend the representation of the Q-Function through offline temporal differential backup . The approach involves discretizing each action dimension and representing each action dimension as separate tokens using Q-values. This allows for the utilization of large and diverse robot datasets, enhancing the efficiency and effectiveness of the reinforcement learning process.

4.3 Robotics Transformer

Robotics transformer 1. By migrating large and diverse datasets, machine learning has now been targeted at downstream tasks and significantly improved performance in many areas (such as computer vision, natural language processing, or speech recognition) by fine-tuning with zero-shot or few-shot. However, the field of robotics has yet to show similar generalization capabilities. Training a general robotics model through open-ended task-agnostic training and incorporating high-performance architectures that can absorb large and diverse datasets may be a promising approach. If a model could act like a sponge, absorbing ubiquitous patterns of language and perception, it may be able to perform better on specific downstream tasks. The question remains whether it is possible to train a model in the field of robotics that can absorb knowledge from other fields. Could the model demonstrate zero-shot generalization capabilities for new tasks? Robotics Transformer 1 (RT-1) was proposed to address the aforementioned question. RT-1 is capable of encoding high-dimensional input and output data, including images and instructions, into compact tokens that can be efficiently processed by Transformer . It exhibits real-time operation characteristics, making it suitable for applications that require rapid processing and response times. In experimental evaluations, RT-1 demonstrated strong generalization. The structure of RT-1 is composed of FiLM , conditioned EfficientNet , a TokenLearner , and Transformer . However, RT-1 is not an end-to-end model.

Robotics transformer 2. Can we pre-train a vision-language model (VLM) that can be seamlessly integrated into low-level robot control? Hereby enhancing VLM generalization capabilities? We can achieve this by training the robot’s trajectory to be represented as a sequence of tokens, effectively mapping natural language instructions into a series of robot actions. To create an end-to-end model that can directly map robot observations into actions, DeepMind employs a collaborative fine-tuning approach. Combining state-of-the-art VLMs with network-scale visual-language tasks on robot trajectory data, Robot Transformer 2 (RT-2) is a model that leverages fine-tuning of a VLM. RT-2 is trained on a web-scale dataset to achieve direct possession of generalization ability and semantic awareness for new tasks. Through fine-tuning a VLM, it is adapted to generate actions based on text encoding. Specifically, the model is trained on a dataset that incorporates action-related text tokens. This type of model can be called a visual-language-action model (VLA) . RT-2 builds upon the policy trained by Robotic Transformer 1 (RT-1) , leveraging the same dataset and an expanded VLA to significantly enhance the model’s generalization capabilities for new tasks.

Robotics transformer X. In robot learning, it is common to train a separate large model for each application or environment. However, this approach can be limiting, as it may not allow for adaptability across different robots or environments. Can we develop a robot policy that is versatile and can be applied across various robots and environments? With the advancements in large models, it is within the realm of possibility to train a versatile model that exhibits strong generalization capabilities for a specific task. Inspired by these large models, X-embodiment trainingX-embodiment training means training robot policy with Open X-embodiment repository’s datasets is proposed, which involves using robot data from diverse platforms for training. This approach enables the model to better adapt to changes in both the robot and the environment, leading to improved performance and versatility. Robotics Transformer X (RT-X) is categorized into two branches: RT-1-X and RT-2-X. RT-1-X employs the RT-1 architecture and utilizes the X-embodiment repositoryOpen X-embodiment repository, a dataset consisting of different platforms. https://robotics-transformer-x.github.io for training, while RT-2-X leverages the strategy architecture of RT-2 and is trained on the same dataset. Experiments demonstrate that both RT-1-X and RT-2-X have exhibited enhanced capabilities. Similarly, robots may benefit from acquiring knowledge across various domains, much like humans.

Related Technologies

In this section, we introduce the related technology used in robotics. Noticing that agents, embodied AI, and robotics based on LLMs all have the same meaning in this paper. Here we divided the model of robotics into four parts, which are perception, decision-making, control, and interaction. We certainly provide a more detailed introduction to the decision-making component, as it serves as the core of robotics based on LLMs. Decision-making serves as a connecting link between perception and control. We summarize the related technologies introduced below in Figure 3.

Perception is a fundamental capability of robots, akin to their input. Currently, multi-modality is a popular approach for robot perception. The models discussed below employ different treatments of perception.

Berkeley Autonomous Driving Ground Robot (BADGR) is a mobile robot navigation system that leverages end-to-end learning and self-supervised non-policy data collected in real-world environments to train its algorithms without any simulation or human supervision. This innovative approach enables BADGR to navigate complex environments with ease and efficiency, paving the way for future advancements in autonomous driving technology. ViNG is a goal-condition model that draws inspiration from GoalConditionedRL . It is capable of predicting the temporal distance between image pairs and the corresponding actions to be performed. By integrating learned policies with topological maps constructed from previously observed data, ViNG’s system can effectively determine how to achieve visually indicated goals, even in the presence of variable appearance and lighting conditions. RECON is a system for robot learning designed for exploring autonomously and navigating in complex and unpredictable real-world surroundings. The core of RECON leverages a latent variable model of learning distance and action, along with non-parametric topology memory, to enable efficient and effective exploration. ViKiNG , built upon RECON mapping, incorporates geographical hints to propose an integrated learning and planning method that utilizes auxiliary information. This method combines a local traversability model. The model evaluates the robot’s present camera observation and utilizes a potential sub-goal to infer the difficulty of achieving it. With a heuristic model that examines hints in the cost graph and evaluates the suitability of these sub-goals in achieving the overall goals, the general navigation model (GNM) aims to train a general goal-condition model for vision-based navigation that can broadly generalize across diverse environments and embodiments, leveraging data from multiple structurally similar robots. By developing pre-trained navigation models with such capabilities, GNM represents a significant step toward realizing this vision that envisions applications for new types of robots.

1.2 Vision-language model

In recent years, large language models and visual models have had great success in their field. However, each of them can only process input in their own corresponding fields (for example, the language model only accepts text as input, and the visual model only accepts images as input), which is relatively simple. People began to focus on the processing of multi-modal input, combining large language models and visual models. Therefore, the multi-modal model that can take both vision and natural language as input was created — the visual-language model (VLM). VLM can process images and text at the same time. In actual use, we also need to distinguish between recognizing 2D scenes (such as some Visual Transformers (ViTs) ) or 3D scenes (such as OSRT ) when processing vision. VLMs come in various types . There are many VLM models emerging. Contrastive Language-Image Pre-training (CLIP) is a neural network that has been trained on diverse pairs of images and text. It has the capability to understand natural language instructions and predict the most pertinent text excerpts associated with a given image, all without directly optimizing for this specific task. CLIP is similar to the zero-shot function of GPT-2 and 3. CLIP is also used in LM-Nav as a VLM to predict the text based on natural language. The landmarks are extracted and built into the topological map. VLM has the versatility to be employed in various downstream tasks including visual question answering (VQA) , optical character recognition (OCR) , and image captioning . Such as PaLM-E treats text and images as latent vectors of multi-modal input. Frozen is also processed similarly to PaLM-E.

1.3 Vision-and-language navigation model

One of the primary objectives of AI research is to develop an embodied intelligence that can effectively communicate with humans and interact with the environment. This embodied intelligence is capable of understanding human language and navigating its surroundings with ease, which has the potential to greatly benefit human society. However, achieving this goal is not without its challenges, including insufficient dataset, navigation processing strategies, processing of multi-modal inputs, and model migration from familiar environments to unfamiliar environments. Despite these obstacles, the development of embodied intelligence remains a crucial area of research in the field of AI . Visual-and-language navigation (VLN) is a model that leverages visual observations to directly learn navigation implications and seamlessly links images and actions across time. As an extension of visual navigation in both real environments and simulated , VLN boasts the capability to navigate complex 3D environments. There are many datasets in VLN that can be exploited.

1.4 Vision-language-action model

Can we pre-train a model that integrates multimodal inputs and low-level robot protocols to enhance the robot’s generalization and semantic reasoning abilities? DeepMind aimed to develop a straightforward end-to-end model that could seamlessly map the robot’s observations into action, thereby creating Vision-Language-Action Models (VLA) . Prior approaches involved incorporating VLMs into robot policies or designing novel robot visual-language-action architectures. VLA instantiated by fine-tuning is first introduced and implemented in RT-2 , leveraging a large VLM. DeepMind fine-tunes the large-scale VLM and pre-trains it on a vast network-scale dataset, transforming VLM into VLA. To unify robot actions and natural language responses, DeepMind integrates actions as text tokens directly into the pre-trained dataset, forming multimodal sentences . Multimodal statements can respond to the command set generated by the robot through observation, outputting corresponding actions. This processing is analogous to LLM processing natural language data, where action-related tokens are decoded and converted into robot actions during interface processing. VLA can significantly enhance the generalization capabilities of robots.

2 Decision-making

Decision-making is a fundamental capability of robots, enabling them to make informed decisions and plan tasks based on their current state and environment. As the core of a robot, decision-making plays a crucial role in connecting the preceding and the following, analyzing input from the perception module to generate appropriate actions.

LLM has the potential to significantly aid intelligent agents, with numerous studies successfully utilizing LLM as the brain to implement intelligent agents and achieve promising results . Our ideal embodied intelligence should be an intelligent entity that can perceive the surrounding environment and produce corresponding output after interacting with humans or the environment. LLM plays a vital role in this process, serving as a central hub for analyzing multi-modal input and converting it into appropriate action output. The development of intelligent agents has progressed through various stages : from symbolic agents relying on symbolic logic ; Reactive agents prioritizing environmental interaction and instantaneously responding ; Reinforcement learning-based agents trained to handle complex tasks but lacking generalization ; Agents with transfer learning and meta-learning based on meta-learning and transfer learning to improve the generalization of the agent to the task. To the current LLM-based agents, where LLM is used as the brain of the agents . LLM can interpret inputs, plan output actions, and demonstrate reasoning even with the abilities of decision-making.

The emergence of ChatGPT has sparked a surge of interest in LLMs within the scientific research community and industry in recent years. LLMs possess exceptional capabilities, often serving as the brains of agents, and have zero-shot and few-shot generalization abilities that enable them to adapt to various tasks without parameter updates. Their natural language understanding and generation capabilities are unparalleled, allowing them to gain reasoning and planning abilities . Additionally, LLMs can parse high-level abstract instructions to perform complex tasks without requiring step-by-step guidanceBabyAGI https://github.com/yoheinakajima/babyagi, and their human-like text-generation capabilities make them highly effective communicators . Furthermore, LLMs can sense their environment , and technologies that expand their action space allow them to interact with the physical environment and complete tasks . They also possess reasoning and planning capabilities, such as logical and mathematical reasoning , task decomposition , and planning for specific tasks. LLM-based agents have been used in various real-world scenarios and have shown potential for multi-agent interactions and social capabilities. Overall, LLMs have revolutionized the field of artificial intelligence and hold great promise for future advancements.

2.2 Capacity of LLM in robotics

LLM serves as the brain of the robot, functioning as the central component that integrates knowledge, memory, and reasoning capabilities to enable the robot to plan and execute tasks intelligently.

Knowledge. The knowledge of LLM for robotics can be categorized into two types: the knowledge that needs to be acquired through learning (which is the pre-trained dataset) and the knowledge that has been learned and stored in memory .

Pre-trained data. There are various types of pre-trained datasets available, and the more extensive and richer the knowledge learned, the stronger the LLM’s generalization and natural language understanding capabilities will be . Theoretically, the more a language model learns, the more parameters it has, enabling it to learn complex knowledge in natural language and gain powerful capabilities . Research has shown that a richer dataset for language model learning can result in correct answers to diverse questions . Datasets can be categorized into different types, such as basic semantic knowledge, which provides an understanding of language meaning ; Common sense, including everyday facts like people eating when hungry or the sun rising in the east ; Professional field knowledge, which can aid humans in completing tasks like programming and mathematics .

Memory. Just like human memory, embodied intelligence should be able to formulate strategies and make decisions for new tasks based on experiences (i.e., observed actions, thoughts, etc.). When faced with complex tasks, the memory mechanism can aid in reviewing past strategies to obtain more effective solutions . However, memory poses some challenges, such as the length of memory sequences and how to efficiently store and index them as the number of memories grows. As the robot’s memory burden increases over time, it must be able to effectively manage and retrieve memories to avoid catastrophic forgetfulness .

Reasoning. Reasoning serves as a foundational element in human cognition, playing a crucial role in problem-solving, decision-making, and the analytical examination of information . Reasoning plays a crucial role in enabling LLMs to solve complex tasks. Reasoning capabilities allow LLMs to break down problems into smaller, manageable steps and solve them starting from the current status and known conditions. There is ongoing debate about how LLMs acquire their reasoning abilities, with some arguing that it is a result of pre-training or fine-tuning , while others believe that it emerges only at a certain scale . Research has shown that Chain-of-Thought (CoT) can help LLMs reveal their reasoning capabilities, and some studies suggest that inference abilities may stem from the local static structure of the training data.

Planning. Humans plan when faced with complex challenges. Planning can help people organize their thoughts, set goals, and decide what they should do in the current situation . In this case, they can gradually approach their goals. The core of planning is reasoning. The agent can use reasoning capabilities to deconstruct the received high-level abstract instructions into executable subtasks and make reasonable plans for each subtask . For example, LM-Nav uses ChatGpt to process received natural language instructions . PaLM-E directly implements end-to-end processing, converting the received multi-modal input into multi-modal sentences for LLM processing . Agents may also be able to reasonably update task planning based on the current situation through multiple rounds of dialogue and self-questioning and answering in the future. Many studies have proposed methods of dividing the execution tasks into many executable small tasks during the planning process. For example, directly break down the execution task into many small tasks and execute them sequentially . CoT only processes one sub-task at a time and can adaptively complete the task, which has a certain degree of flexibility . There are also some vertical planning methods that divide tasks into tree diagrams .

3 Control

Here, we argue that the control module is the key component responsible for regulating robotic actions. This module plays a crucial role in ensuring that the robot’s actions are executed accurately and successfully, with a focus on the hardware aspects of action execution.

Much of the previous work has focused on enabling robots and other agents to comprehend and execute natural language instructions . There are various approaches to learning linguistically conditioned behaviors, such as image-based behavioral cloning that follows the BC-Z method or the MT-Opt reinforcement learning method. Imitation learning techniques train protocols on demonstration datasets , while offline reinforcement learning has also been studied extensively . However, some works suggest that imitation learning on demonstration data performs better than offline reinforcement learning , and other studies demonstrate the feasibility of offline reinforcement learning in theory and practice . Many works combine RL and Transformer structures , and some works integrate imitation learning with reward conditions, such as Decision Transformer (DT) , namely combines imitation learning with reinforcement learning elements. However, DT does not enable the model to learn from the demonstration dataset to have better performance. Deep Skill Graphs (DSG) present a novel approach to skill learning utilizing the option framework. This method leverages graphs to represent discrete aspects of the environment, enabling the model to acquire structured knowledge and learn complex skills within the given domain. CT employs goal-conditioned RL to transform the local skill-learning problem into a goal-conditioned Markov decision process (MDP) .

In the context of navigation robots, early approaches to enhancing navigation strategies with the natural language employed static machine translation to discover patterns. The process involves utilizing discovery patterns to translate free-form instructions into formal languages that adhere to specific grammatical rules . However, these methods were limited to structured state spaces. Recent works have also developed the VLN task as a sequence prediction problem . Additionally, there are methods that leverage nearly 1M labeled simulation trajectory demonstration data for training , but applying these models in unstructured environments remains a significant challenge. Data-driven approaches for vision-based mobile robot navigation often depend on the utilization of realistic simulation techniques or gathering supervised data to directly learn policies for achieving goals based on observations . Alternatively, self-supervised learning methods can utilize unlabeled datasets or trajectories generated automatically by onboard sensors and hindsight relabeling learning .

3.2 How to execute action after parsing nature language

To determine whether a skill can be executed in the current state after parsing a natural language command, a temporal-difference-based (TD) reinforcement learning approach can be employed. This method learns a value function to evaluate whether the skill is executable or not . The value function is derived from the corresponding affordance function of reinforcement learning . Additionally, LM-Nav utilizes a self-supervised learning method to enhance the parsing of free-form language instructions leveraging pre-trained VLM in a large number of previous environments. To address the challenges of long-term tasks, hierarchical reinforcement learning (HRL) can be employed, where higher-level policies play a role in setting objectives for lower-level protocols to execute . The process of mapping natural language and observations into robot actions can also be viewed as a sequence modeling problem . Transformer-based robot control, such as the Behavior Transformer , focuses on learning demonstrations that correspond to each task. Gato suggests training a model on large datasets including robotic and non-robotic.

4 Interaction

Interaction serves as a fundamental module that enables robots to engage and interact with both the environment and humans. To enhance robots’ ability to interact in the physical world, they are often trained extensively. While some researchers utilize artificial intelligence to interact in virtual environments, such as games or simulations, ultimately, these models must be transferred to the real world. However, the accuracy of these models tends to be lower in real-world settings compared to simulated environments.

Traditional game developers manually write over a dozen character behaviors (including class methods and attributes) for the implementation of a game in the Valentine’s Day party’s specific game environment. Almost all of these behaviors are fixed sets, making the process very cumbersome with poor scalability. In games, LLMs have been used to create interactive novels and text adventure games . LLMs are increasingly utilized for planning robotic tasks due to their capacity to generate and decompose sequences of actions. In GA , they created a computer program that can mimic the behavior of human beings, called the Generative Agents. It extends the LLM by using natural language to store complete records of the intelligentsia’s experiences. Synthesizing accumulated memories and reflecting upon them at higher levels over time, the system can dynamically retrieve these memories to plan and guide its behavior. Agent characters engage in comprehensive verbal exchanges utilizing authentic human language. They possess knowledge of other intelligent entities within their vicinity, and the generative agent framework dictates whether they proceed to interact or initiate a dialogue. These intelligent agents’ characters can exhibit quite realistic personal behavior and social interactions. For example, when someone tells one of the agents that they have a desire to organize and host a festive gathering to celebrate Valentine’s Day, these agents will spontaneously invite others to attend, meet each other, date, and be on time for the party together. This innovative architecture empowers generative agents with the ability to retain, recall, contemplate, engage with fellow agents, and strategize amidst ever-changing circumstances.

4.2 Language-based human-robot interaction

There are GUI (Graphical User Interface) and LUI (Language User Interface) for human-robot interaction. GUI refers to a computer-operated user interface that is graphically displayed and uses an interactive device to manage the interaction with the system. Unlike GUI, LUI can directly use natural human language for human-robot interaction, and the most representative LUI product is ChatGPT. Traditionally, the task of simulating human-robot interaction using natural language has proven to be difficult due to the constraints imposed on users by rigid instructions, or the need for intricate algorithms to manage numerous probability distributions related to actions and target objects. However, it is not easy to translate instructions into commands that robots can understand in the real world, and traditionally, fixed collections of desired actions and directives have been used to enable robots to understand human language. However, this can significantly limit the robot’s flexibility and has limited generalizability across different hardware platforms. The LAnguage Trajectory TransformEr introduces a versatile language-driven framework that empowers users to customize and adapt the overall trajectories of robots. The approach leverages pre-trained language models (e.g., BERT and CLIP ) to encode the user’s intention and target objects directly from unrestricted text inputs and scene images. It combines geometric features produced by a network of transformer encoders and generates the trajectory using a transformer decoder, eliminating the requirement for prior task-related or robot-specific information.

Considering the vagueness and ambiguity of natural language, from the point of view of human-robot interaction, robots should enhance the initiative of interaction in the future, that is to say, let the robot actively ask the user questions through the large language model. If the robot feels that the user’s words are problematic and is not sure what they mean, it should ask you back what you mean or whether you mean what you say.

Applications of LLMs in Robotics

Applications of large models and robotics across various domains. Here are ten specific applications of the combination of large models and robotics, along with their explanations:

Autonomous navigation and path planning. Large models provide powerful semantic understanding and reasoning capabilities for robots, assisting them in autonomous navigation and path planning in unknown environments. By combining large models with sensor data, robots can comprehend semantic information in the environment, recognize obstacles, target locations, and navigation objectives, and generate suitable path-planning solutions .

Speech interaction and NLP. LLMs excel in speech recognition, semantic understanding, and natural language generation. Robots can leverage large models for speech interaction, understanding and answering user queries, executing specific tasks, and providing personalized service experiences.

Visual perception and object recognition. Large models possess strong capabilities in image and video analysis, aiding robots in object recognition, target detection, and scene understanding. By integrating deep learning and large models, robots can achieve efficient and accurate visual perception, which can be applied in autonomous driving, robot vision-based navigation, and industrial automation.

Human-robot collaboration and social robots. Large models with natural language processing and emotion analysis help robots understand human feelings and intentions better, making interactions between humans and robots more natural and smart. Social robots can engage in conversations, comprehend emotions, and provide companionship and support, which are applied in fields like healthcare, education, and entertainment.

Humanoid robots and emotional expression. Large models can help humanoid robots better understand and express emotions. Through natural language generation and emotion recognition technologies, robots can engage in emotional communication and expression with humans, providing emotional support and companionship.

Industrial automation and robot control. Large models can be combined with sensor data for industrial process monitoring, anomaly detection, and predictive maintenance. By learning and analyzing large-scale data, robots can achieve intelligent industrial automation and adaptive control.

Healthcare and rehabilitation robots. Large models can be applied in medical and rehabilitation robots to assist in diagnosis, treatment, and patient care. Robots can analyze medical images, patient data, and clinical records, aiding in disease detection, surgical planning, and personalized therapy. They can also provide physical assistance and rehabilitation exercises for mobility-impaired patients.

Environmental monitoring and exploration. Large models can be combined with robot platforms for monitoring and exploration in various environments, such as oceans, forests, and disaster sites. These robots can analyze sensor data, satellite imagery, and other environmental data to monitor pollution levels, detect natural disasters, and explore uncharted territories.

Agriculture and farm mechanization. Large models and robots can be applied in agriculture and farm mechanization, optimizing crop management, monitoring plant health, and automating labor-intensive tasks. Robots equipped with sensors and cameras can collect data from farmlands, and analyze soil conditions, climate changes, and crop requirements, providing farmers with decision support to enhance agricultural productivity and sustainability.

Education and learning assistance. Large models and robots can provide personalized tutoring and learning support in the field of education. Robots can interact with students, and then offer personalized learning materials and guidance based on their abilities and needs . Leveraging the semantic understanding and knowledge reasoning capabilities of large models, robots can answer questions, explain concepts, and help students deepen their understanding of knowledge.

In summary, the combination of large models and robotics holds tremendous potential across various domains, including autonomous navigation, speech interaction, visual perception, human-robot collaboration, industrial automation, healthcare, environmental monitoring, agriculture, and education. It can bring convenience and innovation to human life and work.

Challenges

In the realm of Web 3.0 , big data , AI-Generated Content (AIGC) , and machine learning, collecting datasets has always been a challenge. Currently, training LLMs require vast amounts of data to support their capabilities, particularly high-quality datasets that consume considerable resources. In the field of robotics, collecting datasets is even more difficult. While LLM like ChatGPT relies on text data for pre-training , VLM uses a combination of text and image data . Robotics, however, requires a combination of both, with the addition of multimodal data, such as text, images, and touch, to serve as the robot’s sensory input. These diverse datasets need to be processed in a unified format , allowing the robot’s brain to plan and divide tasks effectively. Unfortunately, there is a lack of ready-made, multi-modal datasets, and collecting them requires a significant time investment. Moreover, policy control is necessary, which includes the interaction between the robot and its environment, necessitating 3D data . The data required for robotics are diverse and scarce, with poor general applicability. For instance, a dataset used to train robot dogs cannot be applied to humanoid robots, and a dataset used for screwing in an assembly line may not be suitable for robots that assemble items. However, with the emergence of platforms similar to X-embodimentOpen X-embodiment repository, a dataset consisting of different platforms. https://robotics-transformer-x.github.io/, the challenges of dataset collection in robotics may be alleviated in the future.

2 Training Scemes

As embodied intelligence necessitates interaction with the physical environment, the model’s training requires specific scenarios, e.g., distributed training . Current research involves training robot-related models in various environments, such as games , simulations , and real-world scenarios . Training in-game scenarios is straightforward, with simple operations like button-pressing. However, the knowledge gained from games may not translate well to real-world scenarios, as the information in complex scenes varies greatly, and language models cannot provide a universal solution. Simulation environments aim to closely replicate reality, with low energy consumption and cost. However, modeling real scenes in simulators can be necessary. While game and simulation environments can train models, they share a common issue: poor transferability to real scenes. For instance, a model with 90% accuracy in a game or simulation may only have 10% accuracy in a real scene. Real-scene training faces significant challenges, such as cost. In simulations, objects can be generated through code , but in reality, purchasing them can be expensive. Transferring models between different training scenarios is a significant challenge.

3 Shape

Currently, most work environments in human society are well-suited for humanoid robots. However, the question arises whether robots must be human-shaped . There are numerous types of robots currently in existence, each with its unique capabilities and applications, like in Figure 4(a). From an energy consumption perspective, wheels are more energy-efficient than legs. Therefore, if a humanoid robot is built, it may be inappropriate to use legs to move objects instead of a conveyor belt. Similarly, a chef robot may not need to hold a shovel and cook like a human. In many cases, designing a pipeline tailored to the specific task at hand can lead to more efficient automation than humanoid robots. While humanoid robots are often depicted in animation scenes, such as in animation like Mobile Suit Gundam or games like Armored Core, their design may not always be practical for applications. For instance, a robot designed solely for washing dishes may not need the ability to sing. Modular concepts like Expedition A1https://www.agibot.com, can offer optimal results for different scenarios by replacing certain components. The shape of the robot remains a topic of debate, and the decision should ultimately focus on suitability for the task at hand.

4 LLM Deployment

Given embodied intelligence, the question arises regarding the deployment of its brain. Current technical limitations prevent the LLM from being deployed locally on the robot. The prevailing industry practice involves employing two brains: a cloud-based super brain and a local brain, like in Figure 4(b). However, a unified consensus on this device-side plus cloud testing deployment method has yet to be established. A feasible solution could be to create a dynamic, compact model on the local client side, capable of handling basic scenario interactions. The cloud-based super brain, on the other hand, would tackle complex and challenging problems. The LLM deployment architecture remains a pressing issue that must be addressed in the future development of agents. This deployment structure also introduces latency issues, as information exchange between the robot and the super brain requires signal transmission. In certain environments, such as those with signal loss, the robot may be left with only its local brain, potentially leading to control loss or unpredictable behavior.

5 Security

LLM like ChatGPT may harbor biases or misconceptions stemming from their pre-training data. These biases can manifest in problematic guidance for users, and robots that rely on LLM as their brains may also exhibit biases . Since robots’ outputs are typically physical actions, biased or misunderstood guidance can lead to harmful consequences for users , such as a chef robot burning down a house while cooking. Beyond physical safety risks, robots also raise concerns about data security . For instance, a robot butler who resides in a home may become intimately familiar with the household’s environment and occasionally require cloud interaction for certain tasks. During user interaction, there is a risk of private data leakage, which could be mitigated by an offline environment, but this may compromise the robot’s performance.

6 Dialogue Consistency

Humans often don’t complete tasks in a single, static step. Instead, they iteratively adjust strategies and goals based on feedback received after taking action. The same is true for embodied intelligence. When faced with high-level, abstract, or ambiguous commands, robots may not be able to decompose them into executable small tasks at first. They need to obtain further feedback from the environment and humans through continuous dialogue to update their goals. Without this ability to engage in continuous dialogue, which enables robots to perform tasks dynamically, their performance will be significantly impaired . Moreover, the maximum length limit of a robot’s context is another issue worth considering. Typically, embodied intelligence may play a housekeeper role, handling daily tasks like washing dishes or drying clothes. However, for long-term tasks like scientific research, robots require more context-understanding capabilities. Currently, there’s a limit to the length of context that robots can handle, and this limitation can lead to catastrophic forgetting . Dialogue persistence is a crucial challenge for long-term tasks.

7 Social Influence

The rapid advancement of LLMs is bringing the era of embodied intelligence, as depicted in science fiction movies and games, closer to reality. This technological breakthrough will undoubtedly revolutionize human society and unleash unprecedented productivity. With robots capable of performing repetitive tasks, the need for human labor in various industries will diminish. However, this shift may also have far-reaching consequences, potentially disrupting social structures and stability . As robots replace low-end manual labor, it raises questions about the fate of those who previously held these jobs. The double-edged sword of embodied intelligence presents both liberation and disruption. While automation may usher in unprecedented efficiency, it also poses challenges for societal adaptation. Some works of science fiction, such as Detroit Become Human, depict a future where robots gain consciousness and conflict with humans, leading to a war between the two. Alternatively, technology may fall into the wrong hands, becoming a tool for exploitation and solidifying class divisions. However, in a worst-case scenario, robots may become a replacement for humans. As we embrace the development of embodied intelligence, we must also confront the ethical and societal implications it entails.

8 Ethic

Embodied intelligence has long been regarded as a mere tool, but it may hold more significance in the eyes of some users. For instance, companion robots can bring solace to lonely individuals, much like a loyal companion. In fact, some people even develop emotional attachments to their first car or a vehicle that has been with them for a long time. If we were to create robots that resemble humans or exhibit human-like intelligence, would they evoke different emotions? In science fiction movies, robots that gain self-awareness and break free from their programming often develop emotions and even marry humans. Interestingly, robots powered by LLMs have already demonstrated a degree of intelligence. Will they eventually become conscious? If embodied intelligence evolves to possess consciousness, should we still consider them tools? This raises questions about the definition of conscious robots and whether they can be considered human. Although this challenge is still far off in the future of smart robot development, it is an intriguing topic to ponder.

Promising Directions for Future Work

Security has always been a pressing concern in various models, particularly with regard to user privacy. However, we argue that the safety of agents during task execution is of paramount importance. In this article, we explore the question of whether an agent’s actions during task execution could cause harm . For instance, consider a scenario where a robot is asked to make lunch, but in the process, it sets the kitchen on fire. In other scenes, imagine a robot tasked with killing fish, but it mistakenly identifies humans as fish and proceeds to chase and harm them. These scenarios highlight the need to limit the actions an agent can perform to prevent potential harm. Current robot systems focus on enabling the robot to determine which actions can be performed based on the current state and environment, without fully considering the consequences of executing those actions. Therefore, we propose that ensuring the safety of task execution must be a top priority, by guaranteeing that the robot’s actions do not harm human rights and interests.

2 Training Scenario Transfer

Due to technical or economic constraints, it is common to train robot action policies in simulated or gaming environments . However, the ultimate goal of agent training is to apply it in real-world scenarios. Unfortunately, training in diverse scenarios can lead to not being acclimatized, which may compromise the agent’s performance when deployed in real-world situations. The fundamental source of this problem can be attributed to the disparity of feedback mechanisms between simulated and real-world environments. In games or simulations, feedback is often more straightforward, with the robot receiving clear and concise information about the outcome of its actions. In contrast, real-world feedback is more complex and nuanced, making it challenging to assess the feasibility of a task in a limited scenario. Therefore, a valuable research direction is to explore methods for transferring model training across different scenarios while maintaining their accuracy in the original training environments.

3 Unify Format of Modal

Currently, many models are utilizing LLM as the robot’s brain, and text-type data is typically the input that LLM accepts. However, for agents reliant on multi-modal perception, efficiently handling diverse input formats poses a significant challenge. To address this issue, a VLA model has been proposed , which uniformly converts visual and natural language multi-modal inputs into multi-modal sentences for processing, and outputs actions in the same format. In other words, multi-modal statements are employed to harmonize input and output. Nevertheless, there is currently no unified processing for other modalities such as touch and smell. It is anticipated that unified multi-modal models like VLA will gain popularity in the future.

4 Modular Components

As previously discussed, the field of robotics currently lacks a unified approach to robot design, with varying opinions on the matter. We believe that there should be a modular design method, wherein each part of the robot can be swapped out like a machine, just like in Figure 4(c), allowing for greater versatility and adaptabilityhttps://www.agibot.com. To achieve this, we must first establish unified specifications for the various modules of the robot. For instance, a robot can be composed of a head, torso, upper limbs, and lower limbs, with the upper limbs and lower limbs being interchangeable based on the task at hand. Among them, the upper limbs and lower limbs can be replaced according to specific tasks. When we need to cook, we can use our upper limbs as a shovel, and when we need to deal with weeds in the yard, we can use our lower limbs as a weeder.

5 Autonomous Perception

Our current research focuses on developing robots that can interact with humans using natural language instructions. In many cases, we study how humans issue instructions and how robots can decompose abstract tasks into specific sub-tasks for execution . However, we also hope that robots can perceive and respond autonomously to handle our current needs. For instance, if our cup falls to the ground and breaks, an agent should be able to perceive the situation through hearing and vision, and then autonomously handle the glass fragments for us. Autonomous perception requires the robot to have common sense, which is a capability that can be integrated into robots based on LLM as the brain. Research on robots’ autonomous perception capabilities is crucial for improving our quality of life in the future.

Conclusions

In this survey, we summarized the methods and technologies currently used for large models in robots. First, we reviewed some basic concepts of large language models and common large models. We explain what improvements will be brought to robots by using large models as brains. We also introduce the representative LLM-based robot models proposed in recent years, such as LM-Nav , PaLM-SayCan , PaLM-E , etc. Next, we divide the robot into four modules: perception, decision-making, control, and interaction. For each module, we discuss the relevant technologies and their functions, including the perception module’s ability to process the robot’s input from the surroundings; the decision-making module’s capacity to understand human instructions and plan; the control module’s role in processing output actions; the interaction module’s ability to interact with the environment. We also explore the potential application scenarios of current robots based on LLMs and discuss the challenges, such as training, safety, shape, deployment, and long-term task performance. Finally, we consider the social and ethical implications of post-intelligent robots and their potential impact on human society.

As LLMs continue to evolve, robots may become increasingly intelligent and capable of processing instructions and tasks more efficiently. With advancements in hardware, robots may eventually become reliable assistants for humans, as depicted in science fiction movies. However, we must also be mindful of their potential impact on society and address any concerns proactively. Embodied intelligence is a new paradigm for the development of intelligent science and is of great significance in leading the development of the future. LLM-based robotics represent a potential path to embodied intelligence. We hope this survey can provide some inspiration to the community and facilitate research in related fields.

Acknowledgment

This research was supported in part by the National Natural Science Foundation of China (Nos. 62002136 and 62272196), the Natural Science Foundation of Guangdong Province (No. 2022A1515011861), Engineering Research Center of Trustworthy AI, Ministry of Education (Jinan University), and Guangdong Key Laboratory of Data Security and Privacy Preserving.