The report notes that as technology advances, embodied intelligence has evolved from an early reliance on the combination of multiple independent "small modules" of AI algorithms to a large, unified framework-driven approach, with significant progress in flexibility and adaptability. Initially, in order to accomplish a specific task, the system invokes different algorithm modules as needed, combined with human intervention to achieve the goal. For example, object recognition algorithms are used to identify objects in vision processing; In terms of control strategy, traditional robotics methods such as reinforcement learning, imitation learning and morphological computing are applied to enable robots to make optimal decisions without human intervention. This phase of technological innovation is mainly aimed at the growing demand for robot applications, aiming to add intelligent features to robots and go beyond the traditional fixed automation mode of operation.
However, with the development of large model technology, embodied intelligence has begun to integrate various functions into a unified architecture, using the potential knowledge understanding and expression capabilities of these large models to not only realize natural language communication, but also support seamless multimodal information processing and transformation. This enables the system to comprehensively process multiple sensory inputs, including language, vision, touch, and hearing, and further execute specific action instructions by fusing motion experience data such as robot action trajectories. This shift marks an important leap from decentralised module integration to integrated smart solutions.
According to the report, the embodied intelligence technology system can be divided into four modules: "perception-decision-action-feedback", which form a closed loop, continuously interact with the environment, realize the reconstruction and mapping of the environment, independent decision-making and adaptive action, and continuously learn and evolve from experience feedback.
Figure: Embodied intelligence technology system
Perception module: responsible for collecting and processing information, sensing and understanding the environment through a variety of sensors. Common sensors on robots include visible light cameras, infrared cameras, depth cameras, lidar, ultrasonic sensors, pressure sensors, microphones, etc. Different sensors are responsible for different sensing tasks, such as measuring distance, detecting obstacles, receiving sounds, etc. After obtaining environmental information, the robot needs to understand the environment through algorithms, and for changeable and unfamiliar scenes, it needs to use multi-modal large models to fuse and judge various environmental information such as sound, image, video, and positioning.
Decision-making module: It is the core of the entire embodied intelligent system, which receives environmental information from the perception module and conducts task planning and reasoning analysis to guide the action module generation actions. In the early days, it mainly relied on manually programmed rule judgment and algorithmic design for specialized tasks, but later reinforcement learning methods based on proximal policy optimization algorithms and Q-learning algorithms showed better decision-making flexibility. The emergence of large models has greatly enhanced the intelligence of embodied agents, and the combination of multimodal large models and world models can realize perception prediction, and in the future, embodied intelligent systems will be able to integrate multiple sensory information, understand instructions more automatically, and enhance task generalization capabilities.
Action module: According to the instructions of the decision-making module, coordinate the movement of various parts of the robot, interact with humans and the environment in physical or simulated space, and complete specific tasks, such as embodied task Q&A, embodied grasping, etc. The action module involves a variety of control strategies, including explicit strategies, implicit strategies and diffusion strategies, etc., and optimizes the control strategies through reinforcement learning and imitation learning to achieve higher accuracy and adaptability.
Feedback module: After the agent completes the action, it collects information about the action execution result and environmental changes, and transmits the feedback information to the decision-making module and the perception module. The decision-making module adjusts subsequent decision-making based on feedback, and the perception module can also further optimize the understanding and perception of the environment based on feedback, so that the agent can continuously improve its own behavior to better adapt to the environment and complete tasks, so as to realize the growth of intelligence and the adaptation of actions.
Related:
Analysis Development of Embodied Intelligence (1)
Analysis Development of Embodied Intelligence (2)
Analysis Development of Embodied Intelligence (3)
Analysis Development of Embodied Intelligence (4)
Analysis Development of Embodied Intelligence (5)
Analysis Development of Embodied Intelligence (6)