Vision-Language-Action Models for Embodied AI

The field of embodied AI is moving towards the development of more transparent and steerable models, with a focus on vision-language-action (VLA) models that can quickly adapt to new tasks, modalities, and environments. Recent work has introduced frameworks for interpreting and steering VLA models via their internal representations, enabling direct intervention in model behavior at inference time. This has led to the development of methods for planning with reasoning using vision language world models, which can understand and reason about actions with semantic and temporal abstraction. Additionally, there has been a growing interest in applying large language models (LLMs) to domains such as autonomous driving, with studies evaluating the transferability of LLM modules to motion generation and anomaly detection tasks. Noteworthy papers in this area include: The paper on mechanistic interpretability for steering vision-language-action models, which introduces a framework for interpreting and steering VLA models via their internal representations. The paper on F1, a pretrained VLA framework that integrates visual foresight generation into the decision-making pipeline, achieving substantial gains in task success rate and generalization ability.

Sources

Mechanistic interpretability for steering vision-language-action models

Planning with Reasoning using Vision Language World Model

Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving

sam-llm: interpretable lane change trajectoryprediction via parametric finetuning

Evaluation of Large Language Models for Anomaly Detection in Autonomous Vehicles

Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Built with on top of