Advances in Vision-Language Robotics

The field of robotics is rapidly advancing with the integration of vision-language models, enabling robots to better understand and interact with their environment. Recent developments have focused on improving the ability of robots to learn from multimodal feedback, such as language and vision, and to generate actions based on this feedback. This has led to significant progress in areas such as robotic manipulation, navigation, and scene understanding. Notably, the use of foundation models and large language models has played a crucial role in these advancements.

Some noteworthy papers in this area include: PRIMT, which introduces a preference-based reinforcement learning framework that leverages foundation models for multimodal synthetic feedback and trajectory synthesis. OmniVLA, which presents a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. VLN-Zero, which proposes a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation.

Sources

PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models

Agentic Aerial Cinematography: From Dialogue Cues to Cinematic Trajectories

VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation

VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation

SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones

Bi-VLA: Bilateral Control-Based Imitation Learning via Vision-Language Fusion for Action Generation

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action

Where Did I Leave My Glasses? Open-Vocabulary Semantic Exploration in Real-World Semi-Static Environments

OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning