Visual Grounding in GUI and Small Object Localization

The field of visual grounding is moving towards more accurate and reliable methods for mapping natural-language instructions to pixel coordinates. Recent innovations have focused on improving spatial encoding, explicit position-to-coordinate mapping, and adaptive iterative focus refinement. These advancements have led to significant improvements in grounding accuracy, particularly in high-resolution interfaces and small object localization. Notable papers include:

  • One that introduces RULER tokens and Interleaved MRoPE to improve GUI grounding accuracy.
  • Another that proposes the progressive-iterative zooming adapter for localizing small objects in driving scenarios.
  • A third that presents UGround, a unified visual grounding paradigm with dynamic intermediate layer selection.
  • A fourth that introduces GUI-Spotlight, a model for adaptive iterative focus refinement in GUI visual grounding.

Sources

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Referring Expression Comprehension for Small Objects

UGround: Towards Unified Visual Grounding with Unrolled Transformers

\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

Built with on top of