Advancements in Vision-Language Models for Embodied AI

The field of embodied AI is rapidly advancing, with a significant focus on developing vision-language models (VLMs) that can effectively integrate visual perception, natural language understanding, and decision-making. Recent research has introduced innovative approaches to improve the performance and adaptability of VLMs in various applications, including robotic manipulation, autonomous driving, and human-robot interaction. One notable direction is the development of self-evolving VLM frameworks, which enable agents to continuously learn and adapt during testing, leading to improved navigation success rates and enhanced decision-making capabilities. Another key area of research is the integration of VLMs with other modalities, such as tactile sensing and audio, to create more comprehensive and human-like intelligence. Furthermore, the use of large language models and multimodal learning has shown promising results in tasks such as visual homing, object manipulation, and scene understanding. Noteworthy papers in this area include 'SelfReVision', which introduces a lightweight and scalable self-improvement framework for vision-language procedural planning, and 'LLaPa', which presents a vision-language model framework for counterfactual-aware procedural planning. Overall, the advancements in VLMs have the potential to significantly impact various applications in embodied AI, enabling more efficient, adaptable, and human-like decision-making.

Sources

Making VLMs More Robot-Friendly: Self-Critical Distillation of Low-Level Procedural Reasoning

LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning

Introspection of Thought Helps AI Agents

Knowledge Graph-Based approach for Sustainable 6G End-to-End System Design

Recurrent Expansion: A Pathway Toward the Next Generation of Deep Learning

AirScape: An Aerial Generative World Model with Motion Controllability

PRAG: Procedural Action Generator

Bridging Bots: from Perception to Action via Multimodal-LMs and Knowledge Graphs

Visual Homing in Outdoor Robots Using Mushroom Body Circuits and Learning Walks

Foundation Model Driven Robotics: A Comprehensive Review

Scene-Aware Conversational ADAS with Generative AI for Real-Time Driver Assistance

Graph World Model

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Vision Language Action Models in Robotic Manipulation: A Systematic Review

NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization

Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation

Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander

All Eyes, no IMU: Learning Flight Attitude from Vision Alone

CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking

Generating Actionable Robot Knowledge Bases by Combining 3D Scene Graphs with Robot Ontologies

Understanding visual attention beehind bee-inspired UAV navigation

Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph

Next-Gen Museum Guides: Autonomous Navigation and Visitor Interaction with an Agentic Robot

An Ecosystem for Ontology Interoperability

FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making

ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

VLMgineer: Vision Language Models as Robotic Toolsmiths

osmAG-LLM: Zero-Shot Open-Vocabulary Object Navigation via Semantic Maps and Large Language Models Reasoning

LaViPlan : Language-Guided Visual Path Planning with RLVR

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models