Advances in Failure Analysis and Localization

The field of failure analysis and localization is rapidly evolving, with a growing focus on developing innovative methods to identify and diagnose failures in complex systems. Recent research has emphasized the importance of adaptability, cost-effectiveness, and accuracy in failure localization, driven by the increasing complexity of modern microservice systems and distributed technologies. Notable advancements include the use of reinforcement fine-tuning to equip lightweight language models with self-refinement capabilities, as well as the integration of multi-modality observation data to overcome traditional limitations. Furthermore, there is a growing interest in automated failure attribution for multi-agent systems, which aims to identify the agent and step responsible for task failures. While significant progress has been made, challenges persist, and ongoing research seeks to address these complexities. Noteworthy papers include: Which Agent Causes Task Failures and When, proposing a new research area and introducing the Who&When dataset for automated failure attribution. ThinkFL, presenting a progressive multi-stage fine-tuning framework for self-refining failure localization. TAMO, introducing a tool-assisted LLM agent with multi-modality observation data for fine-grained root cause analysis.

Sources

Why Does My Transaction Fail? A First Look at Failed Transactions on the Solana Blockchain

ThinkFL: Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning

TAMO:Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Built with on top of