The field of embodied intelligence is moving towards integrating high-level reasoning with low-level control for embodied agents, with a focus on developing scalable and robust models. Recent work has highlighted the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, researchers are exploring new frameworks that integrate prior knowledge learning and online reinforcement learning. Notable papers include:
- Vlaser, which achieves state-of-the-art performance across a range of embodied reasoning benchmarks.
- EmboMatrix, which provides a comprehensive infrastructure for training large language models to acquire genuine embodied decision-making skills.
- ERA, which offers a practical path toward scalable embodied intelligence by integrating embodied prior learning and online reinforcement learning.
- RoboGPT-R1, which enhances robot planning with reinforcement learning and outperforms larger-scale models on the EmbodiedBench benchmark.