The field of deep research agents is moving towards the development of personalized and autonomous systems that can conduct complex investigations and generate comprehensive reports. Recent work has focused on creating benchmarks and evaluation frameworks to assess the performance of these systems, particularly in open-ended and realistic scenarios. These benchmarks have highlighted the importance of personalization, content quality, and factual reliability in deep research agents. The introduction of multidimensional evaluation frameworks has also enabled the comprehensive assessment of long-form reports generated by these agents. Notable papers in this area include:
- Towards Personalized Deep Research: Benchmarks and Evaluations, which introduced the Personalized Deep Research Bench, a benchmark for evaluating personalization in deep research agents.
- DRBench: A Realistic Benchmark for Enterprise Deep Research, which introduced DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings.
- A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports, which introduced a rigorous benchmark and a multidimensional evaluation framework tailored to deep research agents and report-style responses.