The field of multimodal reasoning is moving towards more dynamic and interactive approaches, with a focus on integrating perception, reasoning, and action. Recent developments have shown that neural-symbolic frameworks can significantly improve stepwise reasoning accuracy and problem-solving success over static perception methods. The use of surrogate tasks, such as geometric problem-solving, has also been shown to enhance spatial perception and reasoning in vision-language models. Furthermore, compositional reasoning has emerged as a key area of research, with neuro-symbolic frameworks and visual grounded reasoning approaches demonstrating promising results. Noteworthy papers include GeoSketch, which introduces a neural-symbolic framework for geometric multimodal reasoning, and Euclid's Gift, which proposes a geometric surrogate task to enhance spatial perception and reasoning. Additionally, NePTune presents a neuro-symbolic framework for tunable compositional reasoning, and Logo-VGR introduces a visual grounded reasoning approach for open-world logo recognition. These advancements have the potential to significantly improve the performance of multimodal large language models in various tasks and applications.