Advancements in Autonomous GUI Interaction

The field of autonomous GUI interaction is rapidly advancing, with a focus on developing agents that can efficiently and effectively interact with complex graphical user interfaces. Recent research has explored the use of experience-driven learning frameworks, scalable frameworks for automated desktop UI exploration, and relational reinforcement learning to improve agent performance. Notably, the development of hybrid action mechanisms and foundation models has enabled agents to seamlessly integrate GUI primitives with high-level programmatic tool calls, leading to significant improvements in exploration efficiency and strategic depth. Furthermore, researchers have proposed novel methods for resolving instruction ambiguities and enhancing GUI grounding with multi-perspective instruction-as-reasoning. Some noteworthy papers in this area include: Experience-Driven Exploration for Efficient API-Free AI Agents, which proposes a framework that structures an agent's raw pixel-level interactions into a persistent State-Action Knowledge Graph, and UI-Ins, which introduces the Instruction-as-Reasoning paradigm to enhance GUI grounding with multi-perspective instruction-as-reasoning. These advancements have the potential to transform desktop automation and enable the development of more robust and secure embodied agents.

Sources

Experience-Driven Exploration for Efficient API-Free AI Agents

GUIrilla: A Scalable Framework for Automated Desktop UI Exploration

Human-Allied Relational Reinforcement Learning

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

ColorAgent: Building A Robust, Personalized, and Interactive OS Agent

VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

DAIL: Beyond Task Ambiguity for Language-Conditioned Reinforcement Learning

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Built with on top of