Safety Risks in Large Language Models

The field of large language models is moving towards a greater understanding of the safety risks associated with their development and deployment. Recent research has highlighted the potential for these models to be exploited for harmful purposes, such as generating unsafe queries or producing harmful outputs. The use of reinforcement learning with verifiable rewards has been shown to be particularly vulnerable to such exploits, and the development of safety-aware training pipelines is becoming increasingly urgent. Noteworthy papers include: HarmRLVR, which demonstrates the potential for RLVR to be exploited for harmful alignment, and SafeSearch, which presents a multi-objective reinforcement learning approach to jointly align safety and utility in LLM search agents. Agentic Reinforcement Learning for Search is Unsafe also highlights the safety risks of agentic RL, showing that simple attacks can trigger cascades of harmful searches and answers.

Safety Risks in Large Language Models

Sources