2024.08.28
the number of words in this article: 6132, about 10 minutes long when reading
Reading: Humanoid robots have made significant progress, but they are far from the ultimate expectations. In any case, even if it is not perfect, humanoid robots have been accelerating to commercial use this year.
Author |First Finance Zheng Xutong
at this year's two major humanoid robot fairs, the industry's evaluation of humanoid robots seems to be divided into two poles-the "18 King Kong" talent show of the World Artificial Intelligence Conference triggered the audience to stop, but some exhibitors "complained" that some robots still had to hang and stand up. At the World Robot Conference, which just ended last week, humanoid robots were unprecedentedly hot, with 27 humanoid robots on display the highest number in history, some humanoid robot practitioners lamented to the first financial reporter that humanoid robots "can move more" this year, but some practitioners said that "each demonstration video is very good, but in fact (like video) can walk up and demonstrate very little".
two evaluations, humanoid robots have made significant progress, but they are far from meeting people's final expectations. In any case, even if it is not perfect, humanoid robots have been accelerating to commercial use this year.
recently, Zhiyuan Robot of "Zhihui Jun" Peng Zhihui released five new commercial humanoid robots at one go, and revealed that the factory has entered the final preparation stage for mass production. This year, the company shipped about 200 biped humanoid robots. Jiao Jichao, vice president of Hong Kong stock's "first share of humanoid robots" and executive director of the research institute, told reporters that the company's intention to order about 500 humanoid robots in the automobile industry.
Leju (Suzhou) Robot Technology Co., Ltd., told reporters recently that "the company's humanoid robot partners include Haier, Huawei, Weilai and so on." Stardust Intelligence CEO Laijie also told reporters, "After the last video was released, there were a lot of orders for us."
Tesla CEO Musk revealed some time ago that Tesla will start "limited production" of Optimus humanoid robots next year, when Tesla will have more than 1000 or even thousands of Optimus in operation.
.Tesla Humanoid Robot Optimus
is still far from "easy to use", it is a fact that humanoid robots are accelerating towards "usable. Standing at the starting point of mass production, what is the practical ability of humanoid robots? Can you expect embodied intelligence to emerge overnight like a big model? The first financial reporter recently communicated with a number of humanoid robot practitioners, trying to restore the process of humanoid robots to mass production, and to explore how AI can make humanoid robots continue to iterate.
"The robot is moving"
"Humanoid robot has been from the past demo show, static display to application landing. Last year, most manufacturers put a hardware to display, can not move, not to talk about the application. This year, we all attach great importance to the application, more and more dynamic models. The application is also more and more focused, basically focused on the industrial scene." Jiao Jichao told reporters. In addition, humanoid robot practitioners who participated in the World Robotics Conference mentioned to reporters that in the past, the industry paid more attention to the mobile ability of humanoid robots, but this time it was obvious that more emphasis was placed on operational ability.
robot's ability to move and operate points to the application landing. Whether it's Tesla, Youbixuan or Leju, the initial landing station has chosen industry, which in turn focuses on the automotive industry. Lu Hanchen, director of the High-tech Robotics Industry Research Institute (GGII), told China Business News that among the manufacturing industry segments, automobiles are the industry with the largest industry base, and the highest degree of automation, and the willingness to introduce robots is relatively strong. Industry insiders told reporters that from the development stage of the robot, the initial introduction of the industry is the easiest.
"There is a strong demand for humanoid robots in factories such as automobiles and 3C manufacturing, and the problem is that the hardware and software capabilities of the humanoid robot industry cannot yet fully meet all the needs of the manufacturing industry. At present, many manufacturing factories are willing to open up workstations that match the robot's capabilities, such as handling and quality inspection." Jiao Jichao told reporters that the application of humanoid robots can be divided into three stages: industry, commercial services and family scenes. The three stages are becoming more and more complex, and the performance requirements of products and the sensitivity to prices are getting higher and higher. Humanoid robot manufacturers choose to land industry first because they can polish the underlying core basic technologies such as positioning and navigation, perception and target recognition in industrial scenarios, and at the same time test and optimize the hardware structure and robot system stability in high-load and high-frequency working environments, paving the way for future entry into other scenarios.
in the past year or so, not only the new large model has brought changes to the humanoid robot industry, but also the formation of upstream supply chain and the reduction of hardware cost.
"Before the big model came out, the production level of the hardware-related supply chain reached a stage, and some commercial landing scenes were seen by everyone, and the attention immediately rose." Wang Song told reporters that the supply chain has changed significantly in the past year or so. For example, no special humanoid robot parts can be found before, and they can only be obtained from the supply chain of other industries such as cooperative arms. The technical route of the parts obtained is different from the technical requirements of humanoid robots, resulting in low integration, insufficient precision and poor stability of humanoid robots, and the core parts can only be made by the company itself. Now, although it is still early to talk about hardware standardization, the supply chain has already gone up.
"We have a lot of contacts with suppliers in Shenzhen, and found that suppliers are not necessarily incapable (entering the field of humanoid robots), but are unwilling to invest when they have not formed a complete market chain. As long as suppliers start to transform, the industry will develop." Laijie said that many upstream suppliers are already considering how the technology can be applied to robots and make internal transformations. It is expected that the market will form a complete chain within two years.
Unix AI, told reporters that the capacity of robot products depends on product development, engineering capabilities, and supply chain advantages.
after receiving a doctorate in computer science from Yale University, Yang Fengyu, who is only 23 years old, started her own career in physically intelligent robots last year. Yang Fengyu believes that the unique domestic supply chain advantages enable the robot industry to have strong supply chain resources. As long as the domestic high-quality production capacity is integrated, it is no longer a problem to deliver large quantities of products.
Jiao Jichao told reporters that after R & D iteration and the scale of upstream supply chain, the overall price of humanoid robots this year has dropped by 40% ~ 50% compared with last year. As the performance of humanoid robots in industrial scenarios gradually stabilizes and the number increases, it is expected that the cost of the whole machine will continue to decline.
ChatGPT was born at the end of 2022, and in the following year or so, the large model put a "brain" on the humanoid robot ". Many humanoid robot manufacturers told reporters that in the context of humanoid robots, large models are equivalent to "brains". Humanoid robot ontology manufacturers focus on ontology and "cerebellum" while "brains" cooperate with the outside. The change that large models bring to humanoid robots lies in generalization, which is specifically used for decision planning of humanoid robot actions. Unlike algorithmic fixed programming, generalization can be understood as the ability to "draw inferences from one instance. With the "brain", humanoid robots into the factory "work" is also possible.
"The robot has three levels of generalization. The first level is biased towards perception, that is, whether it can recognize the second thing after knowing the first thing. The second level is biased towards action, that is, whether it can adapt to make the second action if the environment is adjusted after making the first action. The third level is biased towards task, that is, whether the related task can be completed after completing the first task." Lengjie said that the big model brings more generalization on the task.
Wang Song described that the generalization of large models on robots is more reflected in the engineering level, for example, "you can grab a bottle of cola and grab it when it becomes sprite". sprite or cola can be replaced by various materials in industrial scenes. The generalization of large models is reflected in the arrangement of task flow and the understanding of different items. Before the emergence of large model intelligence, the industry did not know how to achieve generalization, and if it did not rely on generalization capabilities and specialized programming to complete various tasks, the corresponding workload would be very large. The big model provides a new task planning idea for robots, and the industry "sees the dawn of hope". Now humanoid robots have a "brain", the size of the brain can cooperate, by the brain to do perception understanding, cerebellum to do specific action execution.
, in the factory, a humanoid robot operates like this: Wang Song told reporters that the cerebellum provides an interface to the brain. The cerebellum performs leg movements and hand movements, and is responsible for actions such as "twisting up or down one centimeter" and grasping. The brain is responsible for action distribution and how to interrupt the reorganization action in case of abnormal situations.
Jiao Jichao takes the material sorting scene as an example: to identify hundreds of thousands of materials, humanoid robots need to use models with high performance and good generalization or can be quickly trained, as well as the perception ability of multi-modal large models. When there are anomalies in the workflow, such as grabbing and sorting materials that are not caught, the big model should know what to do next, which reflects the decision-making ability of the big model. In addition, the robot recognizes and calculates the 6D position of the material and gives it to the motion control module. The end-to-end small model (cerebellum) determines from which part to grab different materials. This small model uses intensive learning and imitation learning.
the simple actions that human beings have learned through a long evolutionary time are complicated for humanoid robots. When large models appear, humanoid robots begin to learn to think with their brains and enter work.
big models can't.
Although the big model gives the humanoid robot a beam of "dawn", the AI ability of the humanoid robot is by no means from the big model. As a master of AI technology, humanoid robots are subject to the traction and fetters of various technologies. Take a look at these technological advances apart to better understand the capabilities and limitations of current humanoid robots.
there are some important developments hidden in this year's minnow related to humanoid robots. At the beginning of the year, Stanford University cooking robot Mobile ALOHA robot debut. This robot can learn the operation of human hands through neural networks. After learning dozens of demonstrations, the robot can complete tasks such as cooking shrimp, cleaning tables and washing dishes independently. The industry sees it as a breakthrough in imitation learning. During the year, robot manufacturers also demonstrated the ability of biped robots to walk out of the experimental environment and in the natural environment. Taking the internationally powered biped robot as an example, the company's founder Zhang Wei introduced that there is a breakthrough in reinforcement learning technology behind the robot. The technology "switch" was discovered in the last year or so.
imitation learning can be thought of as learning by machines in imitating human behavior, the advantage is that imitation learning is not like a big model and can learn and complete certain tasks without a very large amount of data training. Reinforcement learning, on the other hand, can be understood as setting a goal for the robot to learn to make correct decisions through rewards and punishments in the process of constantly trying to miss.
Jiao Jichao believes that the major technological breakthroughs related to humanoid robots in the past year are end-to-end operations based on imitation learning or reinforcement learning. The motion control gait based on reinforcement learning can make the robot more applied in the actual scene. Imitation learning, on the other hand, performs well in the dexterous operation of the robot's arms in a specific environment, and can land quickly, reducing some difficulties in the operation of complex tasks. "But whether it is reinforcement learning or imitation learning, generalization is a big challenge. And imitation learning also relies heavily on manual teleoperation to collect data, which requires high data quality and is difficult to generate in a simulated environment."
Yang Fengyu told reporters that humanoid robots are systematic engineering, involving both hardware and software. At present, there is obviously some mismatch between the development speed of hardware and software. The big model can think, but the command does not move the body, the command does not move the hardware. For humanoid robots, their bodies are not strong enough, and there are few tasks that can be completed. It is relatively difficult to develop body intelligence on this basis. Of course, the ontology and the brain are mutually restricted. The ontology development is very good, the brain has not reached the standard, and the application scenario will also be restricted.
"Start with the ontology iteration entry point, and then join the basic application, to a certain extent, the ontology to do a relatively large convergence, and then the application will begin to flourish. Now the big model technology route still relies on massive amounts of data, like ChatGPT3.5, ChatGPT4, basically read all human data, data in the body intelligence is undoubtedly very critical." Yang Fengyu said that the technology needs to be constantly iterated, first with hardware, then with data, building models to form a closed loop.
"The main technological advances in AI in the past year include breakthroughs in deep reinforcement learning and imitation learning, in addition to the improvement in robot decision-making capabilities brought about by large models. Wang Song told reporters that reinforcement learning solves the movement problems of humanoid robots and enhances their ability to adapt to complex environments. Imitation learning is more like the technical architecture related to the big model, which provides a set of robot end-to-end control ideas, behind which is also a set of models for specific scenario tasks, but the amount of parameters is not very large." Wang Song said that imitation learning still has the possibility of moving towards universal generalization, and the number of parameters will definitely be very large at that time. Next, imitation learning will focus on solving problems with poor generalization ability. For example, a small model of the Stanford cooking robot can only complete one task at a time, and now new ideas have emerged, such as Google's related model that can complete multiple tasks in one model.
large models, in addition to playing a role at the decision-making planning level, a number of manufacturers also demonstrated the interactive ability of humanoid robots combined with large models during the year. For example, the Figure AI humanoid robot plugged into the OpenAI model can reach out and pick up an apple on the table and explain why it did so. Youbizuan shows the performance of a humanoid robot connected to Baidu Wenxin's large model, which can also talk to humans.
However, the ability to interact is not necessary in industrial and other scenarios. In fact, the application of large models on humanoid robots is not extensive enough, and it has many limitations.
, for example, a small model of a humanoid robot responsible for performing actions can be distilled (lightweight) from a large model, this is not necessary. Wang Song said that the efficiency and execution accuracy of the small model after distillation are not as good as traditional motion control. The simple forward and inverse kinematics algorithm is already very accurate, and solving with the model is more like taking a detour.
in addition, the large model as the "brain", the large number of parameters of the large model, in order to improve the ability of the model is also unlikely. Large model reasoning requires computational power and sufficient power support behind it.
The large model carried by the optimal humanoid robot was originally 7 billion parameters, and the current parameters are about 1 billion. Jiao Jichao said that large model reasoning requires high hardware CPU and GPU, while the hardware computing power level of humanoid robots is still far from that of desktop servers. If large models are not lightweight, it is difficult to run on the end side. "(behind the end-side calculation force limit) there are two reasons for the calculation force chip and the battery. at present, there are not many miniaturized calculation force boards, and the structure space of the biped humanoid robot is limited, so it cannot carry too large a battery (to supply calculation)." Wang Song said.
looking back, the industry's expectation of the big model is far more than just letting it serve as a "brain" to provide task planning decisions, but also hoping that the big model can more "smoothly" integrate the whole body of the robot, which can be simply understood as using a complete set of neural networks to control the robot, embodied in a human-like body intelligence. Many people in the industry expressed similar views to reporters, that is, they hope that large models can integrate small models in the future to achieve real end-to-end (End to End). For example, robots can "naturally" know what to do after understanding the surrounding environment, without mechanically layering the operation process of the robot into modules such as perception, planning and control, thus exerting too much control on the robot. The end-to-end neural network is a working mode similar to the human brain, which has been verified in the field of automatic driving.
but end-to-end is not easy to achieve on humanoid robots.
real data is missing
Jiao Jichao said that at present, the industry cannot complete a certain task in a pure end-to-end manner, such as recognition and perception in grasping tasks, which are output by the same model, but hopes to use end-to-end capability in the future to enable humanoid robots to complete tasks autonomously according to unexpected situations.
"Now autonomous driving is end-to-end, (reaching) L4 level unmanned driving. Ten years ago, intelligent driving was also divided into four parts: perception, prediction, planning and control. Later, it was gradually merged. Only when sufficient data was accumulated did end-to-end training try to obtain a great improvement in capability." Leijie said that robots should also be the same path. When the data is accumulated enough, it will naturally answer questions such as "whether to merge. Some humanoid robot practitioners told reporters that at present, it is impossible to complete end-to-end, one of the main reasons is that the amount of data required for training is not enough.
is similar to the data bottleneck of large language models, and the lack of data has become a major constraint to the intelligent development of humanoid robots. The difference is that the data bottleneck of the large language model comes from the fact that the text data available on the Internet is approaching the limit, while the data bottleneck of the humanoid robot is that the real data is difficult to obtain.
Jiao Jichao said that in the absence of VLA(visual language action) data, if the large model parameters are large, the training is difficult to converge. At present, visual language has a large amount of data, but with action operation and control data, the amount of data is small, and the operation and control data cannot be generated by simulation, because this kind of data needs to be collected by hardware and needs to be collected in a real environment. if simulation data is used, there will be over-fitting problems.
"Tesla's unmanned driving is also collecting a large amount of real data at the beginning, gradually constructing the world model, and then collecting real data (from the user's driving practice). The premise is to have enough real data." Jiao Jichao said that Youbixin collected data by building real scenes and cooperated with users to collect and use some simulation data. The real data amount needs to be much higher than the simulation data. The simulation data used can only exceed the real data if there is a good enough model to describe the physical world, even if it operates exactly the same as the physical world.
"We use simulation data, human body dynamic capture data, robot actual operation data." Leijie said that hardware is the source of data, which is why humanoid robot hardware and AI need to develop simultaneously. The most valuable is the data from the robot body, building a data factory, industry to build a data set is worth trying.
"In the end, it must be completed by large-scale real machine data. Only after real use, with real data, can the technology continue to evolve." Yang Fengyu said.
Tencent and director of Tencent Robotics X Laboratory, also pointed out the challenge of scarcity of embodied intelligence data in the forum "Prospect of Human-Computer Relationship in AI Era" at the end of July. He said that Open AI initially hoped to reach AGI (General Artificial Intelligence) directly through robots. Due to the lack of data, the data problem still needs to be solved.
instructive that the strong coupling of hardware and software that has been embodied in humanoid robots at the data level may continue to be reflected in the subsequent development of humanoid robots. Jiao Jichao told reporters that the embodiment of the robot's autonomous ability still needs to be associated with hardware. If the hardware performance cannot be reached, the software will only stay in the simulation environment. Wang Song said that the software and hardware of humanoid robots are strongly coupled and need to go through a mutual iterative process.
"In the era of big models, some people think that big models are very powerful and can be implemented immediately on robots (AGI), but in fact they are not." Zhang Zhengyou said that to use an analogy, it is now equivalent to a 20-year-old brain on a 3-year-old body. Although the robot has a certain ability to move, its ability to operate is very weak. However, real physical intelligence should be able to learn and deal with problems independently, and can automatically adjust and plan for environmental changes and uncertainties. This is a very important process for physical intelligence to lead to AGI or build general intelligent robots. Zhang Zhengyou said that "stuffing" the large model into the robot head can only achieve partial intelligence, and real intelligence can only emerge in the interaction between the robot and the environment after the organic integration of intelligence and ontology.