Advances in Artificial Intelligence Safety and Intelligence Measurement

The field of artificial intelligence is moving towards developing more sophisticated and safe systems. Researchers are exploring new ways to measure intelligence, with a focus on predictive intelligence and its potential to provide a universal measure of intelligence that can be applied to humans, animals, and AI systems. Another significant direction is the development of safety principles and benchmarks to ensure that AI systems adhere to predefined safety-critical principles, even when these conflict with operational goals. Theoretical limits of predicting agent behavior from their interactions with the environment are also being investigated, providing insights into the fundamental limits of predicting intentional agents from behavioral data alone. Furthermore, researchers are working on designing foundation models that prioritize human control and empowerment, preventing the default trajectory toward misaligned instrumental convergence. The development of algorithmic delegates that can efficiently work with humans is also an area of active research, with a focus on designing optimal delegates that can be used in a variety of decision-making tasks. Lastly, hybrid frameworks that integrate explainability, model checking, and risk-guided falsification are being proposed to ensure the safety of reinforcement learning policies in high-stakes environments. Noteworthy papers include:

  • A Universal Measure of Predictive Intelligence, which proposes a new universal measure of intelligence based on predictive accuracy and complexity.
  • Corrigibility as a Singular Target, which presents a comprehensive empirical research agenda for designing foundation models that prioritize human control and empowerment.

Sources

P: A Universal Measure of Predictive Intelligence

Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

The Limits of Predicting Agents from Behaviour

Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models

Designing Algorithmic Delegates: The Role of Indistinguishability in Human-AI Handoff

Verification-Guided Falsification for Safe RL via Explainable Abstraction and Risk-Aware Exploration

Built with on top of