The field of visual grounding is moving towards more accurate and reliable methods for mapping natural-language instructions to pixel coordinates. Recent innovations have focused on improving spatial encoding, explicit position-to-coordinate mapping, and adaptive iterative focus refinement. These advancements have led to significant improvements in grounding accuracy, particularly in high-resolution interfaces and small object localization. Notable papers include:
- One that introduces RULER tokens and Interleaved MRoPE to improve GUI grounding accuracy.
- Another that proposes the progressive-iterative zooming adapter for localizing small objects in driving scenarios.
- A third that presents UGround, a unified visual grounding paradigm with dynamic intermediate layer selection.
- A fourth that introduces GUI-Spotlight, a model for adaptive iterative focus refinement in GUI visual grounding.