The field of multimodal perception and reasoning is moving towards more comprehensive and nuanced understanding of complex scenarios. Researchers are focusing on developing frameworks and datasets that can support multiple critical perception tasks, such as object identification, reference resolution, and next-action prediction. There is a growing recognition of the importance of rich, context-sensitive attribute annotations in advancing robot perception in dynamic environments. Additionally, the community is working to address the issue of multimodal imbalance, which can lead to suboptimal performance on tasks that require genuine visual reasoning. Noteworthy papers in this area include: J-ORA, a novel multimodal dataset that bridges the gap in robot perception by providing detailed object attribute annotations within Japanese human-robot dialogue scenarios. Quantifying Multimodal Imbalance, a method for the quantitative analysis of multi-modal imbalance, which informs the design of a sample-level adaptive loss function. TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. SeeingEye, a modular framework that unlocks multimodal reasoning in text-only LLMs through an agent-based small VLM translator. Unveiling Intrinsic Text Bias, a study that reveals text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.