Advances in Explainability and Reliability of Complex Systems

The field of complex systems is moving towards a greater emphasis on explainability and reliability. Researchers are developing new methods for identifying causal relationships, attributing failures, and optimizing system performance. These advances have the potential to improve the efficiency and effectiveness of complex systems, such as large language models and multi-agent systems. Notable papers in this area include: Online Identification of IT Systems through Active Causal Learning, which presents a principled method for online identification of causal models. Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training, which proposes a lightweight distributed tracing and root cause analysis system. AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?, which introduces an automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection. RAFFLES: Reasoning-based Attribution of Faults for LLM Systems, which presents an evaluation architecture that incorporates reasoning and iterative refinement. AutoODD: Agentic Audits via Bayesian Red Teaming in Black-Box Models, which introduces a framework for automated generation of semantically relevant test cases to search for failure modes in specialized black-box models. Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference, which introduces a failure attribution framework for multi-agent systems grounded in multi-granularity causal inference.

Sources

An Information-Flow Perspective on Explainability Requirements: Specification and Verification

Online Identification of IT Systems through Active Causal Learning

Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training

FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs

TopoMap: A Feature-based Semantic Discriminator of the Topographical Regions in the Test Input Space

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

AutoODD: Agentic Audits via Bayesian Red Teaming in Black-Box Models

Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference

Built with on top of