The field of computer-using agents is moving towards developing more comprehensive and realistic benchmarks to evaluate agent capabilities. Current benchmarks are limited in their ability to account for the heterogeneity of tasks and the corresponding agent capabilities, hindering the development of more advanced agents. Researchers are working to create benchmarks that organize tasks along key dimensions, such as automation level and generalization scope, to enable fine-grained analysis of required capabilities and alignment with real-world scenarios. This is leading to a deeper understanding of the strengths and limitations of current agents and driving progress in the field. Noteworthy papers include OS-MAP, which presents a benchmark for daily computer-using automation, and UI-AGILE, which introduces a comprehensive framework for enhancing GUI agents with effective reinforcement learning and precise inference-time grounding. Phi-Ground Tech Report also presents a state-of-the-art model for GUI grounding, achieving high accuracy on challenging benchmarks.