Advances in Multimodal Reasoning for Autonomous Driving

The field of autonomous driving is witnessing significant advancements in multimodal reasoning, with a focus on developing more robust and explainable models. Recent works have emphasized the importance of spatio-temporal reasoning, physical awareness, and chain-of-thought processes in improving the performance of large language models (LLMs) and multimodal large models (MLLMs) in complex environments. Notably, innovative benchmarks have been introduced to evaluate the holistic understanding of vision-language models, including their ability to reason about ego-vehicle actions and interactions among traffic participants. Additionally, research has explored the use of audio and sound to teach LLMs physical awareness, enabling them to understand real-world physical phenomena. Overall, these developments are driving the field towards more advanced and human-like decision-making capabilities in autonomous driving. Noteworthy papers include: SAVVY, which proposes a novel training-free reasoning pipeline for 3D spatial reasoning in dynamic scenes, and STSBench, which introduces a scenario-based framework to benchmark the spatio-temporal reasoning capabilities of vision-language models. AD^2-Bench is also notable for its focus on chain-of-thought reasoning in autonomous driving under adverse conditions.

Sources

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Teaching Physical Awareness to LLMs through Sounds

AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions

Built with on top of