Multimodal Understanding and Embodied Perception

The field of multimodal understanding is moving towards more efficient and fine-grained models that can effectively capture visual regions relevant to textual prompts. Recent developments have focused on addressing the limitations of existing methods, such as high computational costs and struggles with precise visual grounding. Noteworthy papers in this area include Viper-F1, which introduces a hybrid state-space vision-language model that replaces attention with efficient liquid state-space dynamics, and LIHE, which proposes a linguistic instance-split hyperbolic-Euclidean framework for generalized weakly-supervised referring expression comprehension. Additionally, research on embodied perception has led to the development of novel robotic vision systems, such as EyeVLA, which enables active visual perception and acquisition of more informative data within pixel and spatial budget constraints. Other notable works, such as AGREE and C2F-Space, have made significant contributions to visual document retrieval and space grounding for spatial instructions using vision-language models.

Sources

Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation

Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension

Attention Grounded Enhancement for Visual Document Retrieval

Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

Built with on top of