GUI Agents and Perception

The field of graphical user interface (GUI) agents is moving towards more advanced and robust models, with a focus on multi-turn reinforcement learning, page graph-based approaches, and active perception capabilities. Researchers are exploring new methodologies to improve the scalability, stability, and generalization of GUI agents, enabling them to perform complex tasks and interact with diverse environments. Notable advancements include the development of frameworks that integrate page graphs, retrieval-augmented generation, and self-evolving preference optimization. These innovations have led to significant improvements in GUI grounding, parsing, and perception capabilities. Noteworthy papers include: UI-TARS-2, which presents a native GUI-centered agent model that achieves state-of-the-art performance on various benchmarks. PG-Agent, which introduces a page graph-based approach that effectively captures the complex transition relationship between pages. LASER, which proposes a self-evolving framework that enables vision language models to reason effectively over appropriate image regions. SparkUI-Parser, which enhances GUI perception with robust grounding and parsing capabilities.

GUI Agents and Perception

Sources