Advancements in Large Language Model Agents

The field of large language model (LLM) agents is rapidly advancing, with a focus on developing more robust, reliable, and generalizable models. Recent research has highlighted the importance of evaluating LLM agents in complex, real-world scenarios, such as ultra-long-horizon tasks, multi-step tool use, and adversarial environments. The development of new benchmarks, such as UltraHorizon, SafeSearch, and CAIA, has enabled researchers to assess the capabilities of LLM agents in these challenging settings. Additionally, the introduction of novel frameworks, like QuantMind and Fathom-DeepResearch, has improved the performance of LLM agents in tasks that require long-horizon information retrieval and synthesis. Noteworthy papers include UltraHorizon, which introduces a novel benchmark for evaluating agent capabilities in ultra-long-horizon scenarios, and CAIA, which exposes a critical blind spot in AI evaluation by assessing the ability of state-of-the-art models to operate in adversarial, high-stakes environments.

Sources

QuantMind: A Context-Engineering Based Knowledge Framework for Quantitative Finance

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments

SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents

Mix-Ecom: Towards Mixed-Type E-Commerce Dialogues with Complex Domain Rules

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Agentic Services Computing

Quantifying Generalisation in Imitation Learning

Scaling Generalist Data-Analytic Agents

A Measurement Study of Model Context Protocol

OpenID Connect for Agents (OIDC-A) 1.0: A Standard Extension for LLM-Based Agent Identity and Authorization

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

AuditAgent: Expert-Guided Multi-Agent Reasoning for Cross-Document Fraudulent Evidence Discovery

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm

Make a Video Call with LLM: A Measurement Campaign over Five Mainstream Apps

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

IoT-MCP: Bridging LLMs and IoT Systems Through Model Context Protocol

MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

WALT: Web Agents that Learn Tools

InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents