Visual Grounding in GUI and Small Object Localization

The field of visual grounding is moving towards more accurate and reliable methods for mapping natural-language instructions to pixel coordinates. Recent innovations have focused on improving spatial encoding, explicit position-to-coordinate mapping, and adaptive iterative focus refinement. These advancements have led to significant improvements in grounding accuracy, particularly in high-resolution interfaces and small object localization. Notable papers include:

One that introduces RULER tokens and Interleaved MRoPE to improve GUI grounding accuracy.
Another that proposes the progressive-iterative zooming adapter for localizing small objects in driving scenarios.
A third that presents UGround, a unified visual grounding paradigm with dynamic intermediate layer selection.
A fourth that introduces GUI-Spotlight, a model for adaptive iterative focus refinement in GUI visual grounding.

Visual Grounding in GUI and Small Object Localization

Sources