x

Nie Liqiang from Harbin Ins✘×"←titute of Technology: Multimodal la>♠rge models are the key driving foεσ≠rce for the development of embodied in←™β✘telligence | Ten people talk about em¶₩₩₽bodied intelligence

Release time: 2024-07-14

Without the "brain" of ₩¶≠a multimodal big model↓¶, the "body" is just a<≈÷ mechanical device without intelligence.

The following article is from AI Te$δchnology Review, writte>   n by Chen Luyi.

How is the “intelligence” of embodie>↓  d intelligence manifest∏‌÷ed?

This is one of the most frequentl‌≥y mentioned topics when visiting many rα₹₹esearchers in this f→®ield since Leifeng.coγ✔‌¶m-AI Technology Review launched the &q→≈uot;Ten People Talk about Embodied Int♣‌elligence" column.

In short, embodied intelli ↓gence refers to a technolo↑∏Ω gy that combines inte≥✔™lligent systems with physical entiti₽∞©es to enable them to percei≈‌™ve the environment, make decis"↔ions, and perform actions. <¥✔The key word is "ε∞;embodied", which mea× ns that it is not just abstract algor ∞₩ithms and data, but interac↕£βts with the world through ph±✔ysical form.

However, to achieve true "intelligπ☆∑ence", embodied intelligent systeφ>​ms need a powerful "brain&quφ>♦®ot; to support their complex•∞✔ decision-making and ®β∏learning processes. The "brain&quo↕→t; here is not an orga απn in the biological  ✘‌sense, but refers to an adva ‍δ∞nced computing model that can pr₹←♦ocess and understand multimoda$βl information - a multimodal large mod↕ ♦♣el. This model can integrate±™ multiple sensory data s™ ♣uch as vision, hearing, and toucδ✘h, as well as abstractΩ$ information such as langua☆$ge and instructions, to provide →←robots with a richer an♦★'πd more comprehensive ability to underst€♥↔εand the environment.

In November 2022, the advent o≤±₽♦f ChatGPT demonstrated a breakthrou£•↓>gh in large language models (LLM>"s), which not only inspired unli>π≤&mited imagination of the applicati€✔on of large models in various indusΩ∞tries, but also pushed ©§σ"embodied intellige✘ ©nce" into the spotlight, triggeφβ€Ωring in-depth discussions on ♣φαhow machines can intera♣↑≥ct more naturally with ©∞ humans and the environment, andσ<​♠ inspiring a new wave of multπ↓δimodal large model research.

Natural language processing (NLP) is onα•&πe of the underlying core tech$‍<nologies of large models. Harbin Instλ§≤‍itute of Technology λ↔is a well-established engineering scho✔&ol with strong NLP research and r★♥ich technical accumulation in large mod•♠el research. Jiutian, Harbin In♦♠£stitute of Technology&#↑®✘39;s self-developed, auto♥>≠πnomous and controllable multimodal largδ↑e model, has attracted widespread atten"↕σ¶tion in the industry. Jiut∞×$™ian has the remarkable characteristi₩™¶∏cs of wide modal coverage, top multimo×¶∞&dal data sets, strong modal connection >↓←®capabilities, and st₩∞♦↓rong scalability. It performs we® ll in many evaluation indicators. Jiu'↓∑¶tian's papers on ↑£video-text processing and image-$✘ text processing won the Best Paper A>αΩward of ACM MM 2022.

The multimodal big model and embo××died intelligence research at ‍☆δ≤HIT is led by Professor Nie Liqiang, α♦∏who has focused on mu±φ✘♥ltimodal content ana×→lysis and understanding for t£•‍he past 15 years and is convinc © ed of the importance of multimodal <± perception, fusion and understandi✘Ωng. He realized that traditional ro←✘←γbots have weak autonomous decΩ♥™ision-making capabilities, while♦φ©  multimodal big models are good at und÷♠​ erstanding decisions but cannot Ω×↕₩interact with the physical φ™world. This inspired him to↕Ω←ε combine the two, using the ♣γ​♣robot as the torso and the mult↑£♠ imodal big model as the brain to achi₽★≤eve complementary advantages.

Some people believe that multimodal larγ‍ge model technology will promote th÷✘e rapid upgrading of the ≈$§robot's "brain&§‌↔quot;, and its evolution speed will far♥↔  exceed that of the €÷  robot itself. It may surpass≠¶&✘ the technology maturity point¶& and enter the stage of large-scale£↓ industrial implemen€ tation in the next 2 to 3 years.


1.jpg

Recently, AI Technology Review ®δ☆♥visited Professor Nie Liqiang and φβdiscussed with him topics sucγ‍↕h as research trends in the field of <¶§¥embodied intelligence ™↓and the challenges faced←< by the integration of i≠®δ→ndustry, academia and researc♥∑h. The following is a t∏↓ranscript of the interview∞<♦↕ between AI Technology Review and N‌¶ie Liqiang on the top±←♥ic of embodied intellα™‍↑igence. Due to space¥♦ limitations, AI Technology Revi∞σew has edited the original meΩε÷φaning:


1

"Brain" drive☆ s the development of embodied intellige$>₹nce

AI Technology Review: What do you thi✔Ω≠nk of the recent embodi<‍ed intelligence craze? When peopl>¥e are researching and discuss±φing embodied intelligen←♣ce, what are their expectations for♦≤₽ technology and applicat↕→ions?

Nie Liqiang: The embodied intelligence craze £ ♣is the result of the combination↕≠× of artificial intelligence bi☆§♠÷g model technology a∑™φnd robotics technology. The breakth©πrough of big model technology in arti"♣ ficial intelligence provid♥✘±✘es robots with a new ≠ "brain", and the intera™≤ ction between robots and the physi£≠cal world also brings new focus to¶" big models. The two promote eaΩ ​ch other and complement each​± other's advantages.

Research trends in the f✔πield of embodied int>£♦>elligence are also constan$↑βtly changing. In the initial stage ofσ♣ big model empowerment, some wo •rk directly applied new achievem₽∏↓ents in the field of artificial inte₽♣ lligence to robots, but it was not ₩¥☆>in-depth enough. For example, com₹"®mon modalities of multimodal big m¶ odels are vision and text, but€ ‌ robots are exposed to a wider range >♠​of information - vision, he§₩≥aring, touch, human instruc£ ≤tions, the position of the robot a≥§rm, etc. In the future, big models §φneed to adapt to the c✘‌haracteristics of embodie™÷‌d intelligence tasks in₹‌Ω the physical real world to p≥±erceive and interact, and inteΩαgrate rich multi-modal inform☆↑≠ation.

Recently, research on embodied intel×♦ligence driven by big models has g☆✔£radually deepened, movλ​"ing from preliminary απ applications to deep integration, esp×←£πecially the integratio±∑≥£n of robot motion control, which☆→©γ is the key to technological deve↑♠∑βlopment and also a major chaβεllenge. As research deepens, we exp>↓™ect big models to more compr€₹∏ehensively understand→↓ and control the robot's σγbody and achieve deeper physical i¥×♥αnteraction.

If the challenges in t♣‌γhe field of embodied in'€telligence are effective≥✘  ly solved, its applicatio ≤↔n potential is huge. Embodied♥× intelligence applicationα£σ±s can integrate intelligen¥→♦t bodies into various vertical field<↓₽s such as intelligent manufacturing‌× and service industries, such as ☆≠≈←industrial inspections and houseke÷>eping services, so that embodie→≈d intelligence can l÷★ ✘ead the upgrade of new manufacturing,•‍ service and other indust↔∞∏ries. As the technolo£ε≥λgy matures, its application s ✔ ↕cenarios will become m$"ore extensive.

AI Technology Review: W♦λhat role do multimodal large models plaΩ₩∑≠y in embodied intelligence?

Nie Liqiang: The multimodal big model♠  is the "brain" of the¶•$ embodied intelligent∞✘ > robot and is of vitε&al importance. It is located at the u¥ pstream of development and proλ§♦vides intelligence for the rob♣$✘ot. Without this "braε‌≈÷in", the downstream robot "✔∏;body" is just a ×✘αmechanical device without intelligen•λ≠γce. The powerful multimodal big model is ₩↑Ω the key driving force f♠™₩↔or the development of the embod​γied intelligence field .

The multimodal large mod€ε©®el transcends the limi±§↔ tation that a single mo§✔®✔dality is not enough to cope with com→∑♣ plex actual scenarios, greatly i<λ‌×mproving the robot's percep÷≠tion and understanding ca©®pabilities, enabling the robot to un≤♠derstand complex scenarios and task&δs more accurately and comprehensively. ≤β§In addition, the multimodalφα↑δ large model has learn™©ed a wealth of human knowl♥∑∞edge after large-scale data pre-tr∑™ε&aining, giving the ro¥↔εbot the ability to make autonomous pla✘&ns and decisions.

The multimodal large mode$•​l also optimizes human-machine i↑​♣♦nteraction. It allows robots to a ¥ccurately understand human intention×>​s through multimodal♠™€ information such as voic<Ωφe and gestures, making the interaction®↑ between us and robot↔₩s more natural. The power★☆ful generalization ability of the mul<₹£¶timodal large model also l÷γΩγays the foundation for the r↑•Ω♦obot's autonomous ε₽<δlearning ability, helping the r™€§obot adapt to changin♣£g tasks, and taking a™©♣ big step towards becoming a truly inα★telligent entity with the a₹£bility to autonomously ₹≠learn and adapt to environment♠¶al changes.

2.jpg

I believe that the multimodal​λπ> large model, as the "brain",₩∏✘₹ affects all aspects of the≈'✘ robot. Its upstream >Ω>σempowerment of the robot ha₹☆s removed the key obstacles to the imp≤φ<lementation of embodied "¥φ¶intelligence and is tβ©he source of progress in the field of e≈♦φmbodied intelligence.


2

Future Trends: Humanization  ™®and Collaboration

AI Technology Review: What t✘∏₽™rends do you think wi✔γll be the future development of multi±‌modal large models in γ♦the field of embodied intelligε"ence?

Nie Liqiang: The future development o'> f multimodal big models in the field ★©↑of embodied intelligence will br♠∏↕ing revolutionary chang♠®✔es, making AI systems more humane in the∞γ¶ir interaction with and understanding ∏λof the physical world . It is foreseeable that th&¥✔>e following key tren♦®♣"ds will shape this field in the comiε$§ng years:

Multimodal perception : The model will seamlessly integ♠¥​₩rate multiple sensory i‌©♠nformation such as touch and$® smell to provide a more ☆ πcomprehensive understa<™δnding of the environment, close t∏✘♥φo human perception capabilities .

Model lightweighting : Develop efficient mu₽₩δ♣ltimodal large model architecturesασ> and use model compression and knδα₹owledge distillation t★≤ echniques to improveσ∏←↑ the flexibility and efficiency€ ¶ of embodied systems.

Transfer and few-shot ₽Ωlearning : Embodied AI will demo'→nstrate advances in transfer learning ''and few-shot learnin✔≤g, quickly adapting to new tasks with✔₩♥out requiring large am←σ¶$ounts of data for train₹∑✘•ing.

Development of underlying techn≤✘ology : Models will better connect abs↑"←tract knowledge with physical real∑↑↕ity, promote breakthroε‌™ughs in common sense r∞​easoning and causal understa♥☆nding, and enhance long-term memor↑ ✘y and continuous learning cap∏&♣abilities .

Natural interaction capabilities : Improve the intuitivenes​Ωs and contextual awareness↔÷γ of communication between people and A←"₩≠I machines, enabling robots to con÷•duct complex conversations and i✘≠♥nterpret environments and actio©•®ns.

World model building : Creating a comprehensive int¶πernal representation₩® of the world for planning, ÷♦prediction, and decision-making b←φ©♣y embodied AI.

Neuromorphic computing fusion : Multimodal large models are→÷ combined with neuromorphic ‌$↕computing methods to simulate£→₩ biological neural networ✘∏÷ks and improve energy efficie£γ€>ncy and adaptability.

These trends suggest that in the fut♣λure, embodied AI systems will become cl​₹oser to humans in te±α£rms of understanding and int'±♠σeracting with the worl ∞>d through multimodal large models, ope≈₹✔ning up possibilities ‍±∞for a wide range of applications a¥✘nd fields.

AI Technology Review: Wha ₹t do you think is the biggest challen  ge currently facing λ♠λlarge multimodal models?

Nie Liqiang: The biggest challenge of multi∏☆Ωmodal large models is how to i✘≈✘"ntegrate and align mulσαtiple data modalities while maintain≥×∑&ing coherence, efficie↔★γ‍ncy, and ethical consideratio© ∑εns. Different modalities such §∏as text, images, audio, and ♦≥video have unique charac≠σ♦☆teristics, and aligning them↕" is a fundamental problem th♣★₹at requires effective shared repre<™ ​sentation through pre-training, fine-tuα≠₹ning, and architecture design.

The computational resource requireme ‌≈nts required for large multiπ₽modal models grow exponenti☆π✔§ally with size and modaλ lity, raising issues of scalability,↓±↔ accessibility, and deployabε​ €ility that may limit the popularity←¶← of the models.

Data quality and diversity ar★★☆®e also a significant hurdle. Acquiring "©large-scale, high-quality,™π£ and unbiased multimodal datasets is a∑• time-consuming and exp∞✔ensive process.

The complexity of models also mak∏©←≈es it increasingly difficu‍₹γlt to ensure interpretabi$✘÷↔lity and understandabi$←₽lity, which are critic♣Ω'al to the trustworth×↑★✘iness of models in critica¶≥l applications.

Finally, multimodal large models aλ↔'lso face challenges in terms of et'"hics and social impact. Issues such asα≠<♠ misinformation, deep fakes, an​→•™d privacy violations require the f‌↔σormulation of corresponding safegu₩≥'βards and ethical guidelines, and more £"≥☆importantly, the attention¥✔π and cooperation of al​♣→l parties.


3

Collaboration between acade∞£δ♦mia and industry

AI Technology Review: What do​♦ you think of the curren≤Ω♣εt collaboration between academia ✘<"and industry in embodied intelligence σ research?

Nie Liqiang: Embodied int₩✔™✔elligence research re®•quires the combination of basic£☆ research and innova'₽© tive thinking in academia₹≈ with the practical exper'‌$"ience and data of the indu≤"φstry to jointly overcome compl₩✔₹ex scientific and technological δε challenges. Many embodied companies i♦™Ωπn the past 1-2 years were incubate↓&d by universities. The increase∏¶λλ in university-incubated companies shΩ>ows the key role of academia i'≈n promoting technology com∞β∏♣mercialization.

The government's support has→↕ provided impetus for • school-enterprise cooperation,÷₽φ♥ and by encouraging schools and enα∏σ÷terprises to jointly apply for pro§™♥✘jects, it has provided the ne¶​&÷cessary economic and platform sup♣←♥port. The establishment of jo≠αφ$int laboratories has pro✘ moted the deep integration of ✔φ§academia and industry, and acceleraπ↔§ted the exchange and innovati π¥>on of knowledge.

To strengthen cooperation, we need tβ¥&®o further align academic research wi✘¥th industry needs, develop standard→★"'ized embodied intelligence resea ‌₽Ωrch platforms and protocols, and α‍cultivate talents who can coδ÷↔nnect the two worlds. As educators, we have the responsibi±'lity to cultivate stude≈≠"→nts' cross-border capβ☆"abilities in knowledge, ↓∏φtechnology, and researΩ♣$≈ch methods.

Overall, the cooperation between a€÷±↔cademia and industry shows g∑≠reat potential in the field of embodiedγ÷δ  intelligence. Through government s​≈εupport, joint laboratori"®es, and alignment of research with de "•≠mand, universities and enterprises ​εσβwill jointly promote thδ$Ω≈e innovative development of embodieδΩ£d intelligence.

AI Technology Review: What is £>the prospect of embo≤÷‍¥died intelligence in academia a¶<₹&nd industry? What specif÷↕ic research cases do you and your team↑ε≈ have?

Nie Liqiang: Embodied intelligence is highly fa↑¥&vored in both academia and in≥®™αdustry, and has opened up a new p​¥×ath for cutting-edge cross-disciplin♠§ary research. Both AI researchers anδ d robotics researchers are acti↔₽&vely exploring this field. T∞π♣he industry is optimistic about the↑♦ challenges and application pro​≠‍spects of large-scale model-enabled r≥'±obots.

3.jpg

(Ruoyu Jiutian projec≤∑>©t unmanned kitchen scene technic±π→πal verification)

HIT has made significant resea≥¶✔±rch progress in the field of ππαembodied intelligence, such as ¥φ✘↔the Ruoyu Jiutian pr σoject, which has achieved technical ve'♥δrification in unmanned kitchen scenar€ ios and made breakthrou∞ε♠∏ghs in key technologies su↓↓♥ch as multimodal large m<πodels driving group intelligence. We haδλve successfully combined multimodal ε$∑↓large models with robot entities÷• and developed a robot‍&∏ system with perception, interac<®₩ tion, planning and action capa£↓bilities.

In this process, we faced chall∞‍♦enges such as multimodal inform₩≤ation fusion, complex task p↑♦®lanning, and precise motio✔ ≥n control. Each step  ↕±>requires careful research. For example,¥≥ large models must effect₩↓‌ively process multimo≥  ≤dal information, the robot≈¶ ™9;s "brain""γ↔®; needs to accurately plan tasks, an∏¥♥d the "cerebellum>∞" is responsible for precise a ♥&♦ction execution. These research r₹♥esults provide a solid foundation<→πγ for the application of embodi★☆ed intelligence.

AI Technology Review: Wha∏¥∑"t are Harbin Institute of Technolo♥₩♣¶gy’s future development plans in the fi< ••eld of embodied intell♣↔εigence?

Nie Liqiang: At present, based on HIT's curreε→×✘nt research foundation in ≠÷multimodal large mode→×☆ls and robots, we have formulated a sys₩∑tematic research plan for embodied>☆λ≠ intelligence, including mδ∑ultiple aspects such <← as perception, planning, operation,♣™ and group collaboration of in↑>≠telligent bodies, coveri←∑γ ng various forms of intelli​ ​​gent bodies such as robotic arms, d♥™rones, and humanoid ro∑∞₹φbots.

In short, embodied in↕σ<telligence is a promisi"©π∑ng research field. HIT will conti✘λ€≥nue to promote scientif£σ€ic and technological innovation and tal"♥♦±ent cultivation, and strive to ma≠©ke greater contributions to≥•>₽ academia and industry.

4

Practice of the Brain + Cer•÷ebellum Paradigm

AI Technology Review: Ruoyu $ Technology once proposed the£♣ slogan of "equipping roβ₽bots with brains". How do you>¶±₹ view the synergisti÷‍c relationship between the brain and c←‌®erebellum, and future research direc♣↔§≤tions?

Nie Liqiang: Ruoyu Technology is a high-tech comp®•>™any incubated from Harβπε$bin Institute of Technology. It e¥♣mphasizes the collaborative work of™γ the robot's cogn★βitive system (brain) an≥<≤≥d motion control system (cerebe♠' ₽llum). The multimodal larg♥ e model Jiutian is respons ≥σible for handling understanding, percep≈∏♦tion, planning and decision-making tas₹>←→ks, while the cerebellum performs pr♥™∏ecise physical movemen• ✔ts and interactions. This c™☆↑ollaboration ensures that the r☆₽obot can perform specific control £✘according to high-level >×≤instructions and feed back execu≤★÷tion to the brain to adjust the☆ Ω strategy, which is crucial for ♣&adaptability and robustne←✘£↔ss.

Ruoyu's future researc×→✘∏h will focus on strengtheningδ↔∞ this synergy, integrating model ∞↓£≠planning with low-leve¥β€l control algorithms, incl δuding developing err♠↑or correction and online learniε¶‍λng mechanisms to enab∏&le the brain to quickl≠•'y adjust according to the execu÷↑tion results of the cerebellum,λσ optimizing the planning o<÷§f long-sequence tasks, and i∞₩mproving the robot's percep®✔tion and decision-making capabilities​✔‌  through multimodal perception ‍Ω and adaptive learning↑↔. In addition, Ruoyu will also¥≤ε explore how to use the brain's ≥​€high-level understanding ability to im÷×α✔prove the performance of t ↔he cerebellum, such as guidε‌ing grasping planning or trajectory op‍©timization through semantic$☆ understanding.

AI Technology Review: What i♣ nnovations and breakthroughs has<±• Ruoyu Technology made in multimodal bi₹£"g models and embodied₽≈' intelligence? How i$€↔"s multimodal big model technolog∑£•y applied to products&>?

Nie Liqiang: Ruoyu Technology has made breakthroγγ• ughs in the development ★ of embodied intelligence drive•∑n by multimodal large models. It has i ✘Ωnnovatively implemente‍™&φd the cerebrum-cerebellum par&★adigm, integrated natural lang$>↓uage processing, visual perception,ε€Ωβ and action planning, ×¶and enabled robots t₩πo have intelligent "ε®•;brains" in multiple fields.

Core technologies include enhanced ≥β★retrieval large model de-hallucina•$tion planning, which allows robots •≤ to autonomously perform complex ta✘☆sks based on natural<∑ language instructions, such as orde♥∞"​r processing and serving coordinatiδ​§on in unmanned kitchens‌ >₽. In terms of 3D perce♣λγption, robots can understand and®✔&< manipulate objects in complex©>Ω  environments without pre-registδ£↓ration, showing high flexibility and rφ♣‍obustness.

Ruoyu Technology has also achieved imit≠&Ωation learning driven by diffu←'ε®sion models, enabling robot± s to learn complex skills♠★•© without programming. Thes↔↓‌>e technologies are integrated∑♣λ into our Jiutian robot &q® ≥&uot;brain", supporting multi ±εmodal interaction, and throu ∏☆gh standardized cloud + end d←★>elivery, through API +↔♠π£ DK (SDK), in cooperation with ₩♣≈industry chain partners, applied to <×food processing, sorting, βδ​∞assembly and 3C industries, etc.

Ruoyu has deployed the "Jiut §×ian" robot in special fields,<♦ using imitation learning to efficientlγ y perform commercial tasks. In the‍β↓ future, Ruoyu will prλ≠‍→omote the productization§← of multi-agent planning accoΩ↑≈™rding to scene requirement↕↔s and realize a closed business loop u¥£♥≥nder the collaboration of multiple εβ✘robots.

AI Technology Review§↔: How do you evaluate t​↑he current application effect of embod∞≈ied intelligence technology in®₽ actual scenarios?

Nie Liqiang: Embodied intelligence techn∏>ology has demonstrated significant ≥₽benefits in many fields. In thΩ↑₩±e manufacturing industry, it has i★δ≈σmproved the interactive capabilities oΩ"f robots, enhanced production effi‌φσciency and flexibility, and reduced h¥€↓uman errors. In the logist©♣€ics and warehousing fields, embodied ε↕intelligent robots have optimi‌✘™λzed the classification and handlin¥☆g processes of items through ÷ autonomous navigation and dee™♥p learning algorithms, improving →€≈logistics speed and reduci©§β♠ng costs.

The service industry has also wi¶÷€‍tnessed the benefits of embo•"©βdied intelligence, su✘÷©ch as welcoming, order↓ ★ing and delivering robots in"  the hotel and catering i§>↑&ndustry, which improve c‌δ<εustomer experience and✘₩ save labor costs. Despite the chal≈✘ lenges of technical c​&>osts, environmental ada βptability and ethics, th₩§&×e application effect of embodie₽'d intelligence technology in §§actual scenarios is positive♥$ and shows broad prospects, but it ©σ↔•still needs to be continuously improve∑✘d and optimized to aΩ♠βdapt to the ever-changing marke>₹♦t needs.


Contact Us

business@ruoyutech.com

Address:Room 903, Block A, Zhonggua>εn Times Square, Nanshan District, Ωδ♥Shenzhen, Guangdong, China

Copyright@ Ruoyu Technology Powered by EyouCms   京ICP證000000号  粵公網安備44030902003927号