Advances in AI Safety and Robustness

The field of AI research is moving towards a greater emphasis on safety and robustness, with a focus on developing methods to mitigate potential risks and ensure that AI systems behave as intended. Recent work has highlighted the importance of evaluating AI systems in a more comprehensive and nuanced way, taking into account factors such as context, uncertainty, and potential biases. Noteworthy papers in this area include the work on Stress Testing Deliberative Alignment for Anti-Scheming Training, which proposes a framework for assessing anti-scheming interventions and demonstrates the effectiveness of deliberative alignment in reducing covert action rates. Another notable paper is Safe-SAIL, which introduces a framework for interpreting sparse autoencoder features in large language models to advance mechanistic understanding in safety domains.

Sources

Stress Testing Deliberative Alignment for Anti-Scheming Training

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks

The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking

On The Reproducibility Limitations of RAG Systems

Steering Multimodal Large Language Models Decoding for Context-Aware Safety

Extracting Conceptual Spaces from LLMs Using Prototype Embeddings

SCORE: A Semantic Evaluation Framework for Generative Document Parsing

Semantic Representation Attack against Aligned Large Language Models

Solving Freshness in RAG: A Simple Recency Prior and the Limits of Heuristic Trend Detection

What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities

LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

RAG Security and Privacy: Formalizing the Threat Model and Attack Surface