The field of robotics is rapidly advancing with the integration of vision-language models, enabling robots to better understand and interact with their environment. Recent developments have focused on improving the ability of robots to learn from multimodal feedback, such as language and vision, and to generate actions based on this feedback. This has led to significant progress in areas such as robotic manipulation, navigation, and scene understanding. Notably, the use of foundation models and large language models has played a crucial role in these advancements.
Some noteworthy papers in this area include: PRIMT, which introduces a preference-based reinforcement learning framework that leverages foundation models for multimodal synthetic feedback and trajectory synthesis. OmniVLA, which presents a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. VLN-Zero, which proposes a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation.