The field of multimodal understanding is moving towards more efficient and fine-grained models that can effectively capture visual regions relevant to textual prompts. Recent developments have focused on addressing the limitations of existing methods, such as high computational costs and struggles with precise visual grounding. Noteworthy papers in this area include Viper-F1, which introduces a hybrid state-space vision-language model that replaces attention with efficient liquid state-space dynamics, and LIHE, which proposes a linguistic instance-split hyperbolic-Euclidean framework for generalized weakly-supervised referring expression comprehension. Additionally, research on embodied perception has led to the development of novel robotic vision systems, such as EyeVLA, which enables active visual perception and acquisition of more informative data within pixel and spatial budget constraints. Other notable works, such as AGREE and C2F-Space, have made significant contributions to visual document retrieval and space grounding for spatial instructions using vision-language models.