Advances in Safe and Reliable Large Language Models

The field of large language models is moving towards developing safer and more reliable models. Recent research has highlighted the importance of addressing cross-modal contexts, mitigating modal imbalance, and ensuring robustness to adversarial attacks. Studies have shown that current models can be easily jailbroken or over-refuse harmless inputs, and that they often prioritize certain modalities over others. To overcome these limitations, researchers are exploring new approaches such as certifiable safe reinforcement learning, survival analysis, and consequence-aware reasoning. These innovative methods aim to improve the safety and reliability of large language models, enabling them to better reason about links between actions and outcomes, and to provide more trustworthy outputs. Noteworthy papers include: Mitigating Modal Imbalance in Multimodal Reasoning, which demonstrates the importance of addressing cross-modal attention imbalance; Time-To-Inconsistency, which presents a comprehensive survival analysis of conversational AI robustness; and SaFeR-VLM, which proposes a safety-aligned reinforcement learning framework for multimodal models.

Sources

Mitigating Modal Imbalance in Multimodal Reasoning

Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

A Granular Study of Safety Pretraining under Model Abliteration

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Quantifying Risks in Multi-turn Conversation with Large Language Models

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

The Algebra of Meaning: Why Machines Need Montague More Than Moore's Law

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

Built with on top of