Vision-Language-Action Models for Robotic Manipulation

The field of robotic manipulation is witnessing significant advancements with the development of Vision-Language-Action (VLA) models. These models have demonstrated strong capabilities in learning complex behaviors from large-scale, multi-modal datasets, and have shown promising results in various robotic manipulation tasks. The current trend in VLA models is towards improving their robustness, generalizability, and efficiency. Researchers are exploring new architectures, training methods, and techniques to enhance the performance of VLA models in real-world scenarios. Notable papers in this area include SwiftVLA, which proposes a lightweight VLA model that achieves comparable performance to larger models while being 18 times faster and reducing memory footprint by 12 times. Another notable paper is VLASH, which introduces an asynchronous inference framework for VLA models that delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes. Overall, the field of VLA models is rapidly evolving, and we can expect to see further innovations and improvements in the coming years. Noteworthy papers include SwiftVLA, which achieves state-of-the-art performance on edge devices, and VLASH, which enables fast and accurate reaction control for VLA models.

Sources

Sample-Efficient Expert Query Control in Active Imitation Learning via Conformal Prediction

Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

MM-ACT: Learn from Multimodal Parallel Generation to Act

CycleManip: Enabling Cyclic Task Manipulation via Effective Historical Perception and Understanding

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

Real-World Reinforcement Learning of Active Perception Behaviors

LLM2Fx-Tools: Tool Calling For Music Post-Production

DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models

GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models

ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols

Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

DAWZY: A New Addition to AI powered "Human in the Loop" Music Co-creation

GOMP: Grasped Object Manifold Projection for Multimodal Imitation Learning of Manipulation

Active Visual Perception: Opportunities and Challenges

PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention

Hierarchical Vision Language Action Model Using Success and Failure Demonstrations

Vision-Language-Action Models for Selective Robotic Disassembly: A Case Study on Critical Component Extraction from Desktops

MOVE: A Simple Motion-Based Data Collection Paradigm for Spatial Generalization in Robotic Manipulation

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization

STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models