Advancements in Large Language Model Safety and Security

The field of large language models (LLMs) is rapidly evolving, with a growing focus on safety and security. Recent research has highlighted the importance of system-level safety, red teaming, and the development of effective guardrails to prevent jailbreak attacks. Innovative approaches, such as collaborative multi-agent frameworks and rhetorical-strategy-aware rational speech act frameworks, are being explored to improve irony detection and figurative language understanding. Furthermore, researchers are investigating the ethics of using LLMs for offensive security and developing frameworks to evaluate the security and alignment of LLMs. Noteworthy papers include:

  • CAF-I, which introduces a collaborative multi-agent framework for enhanced irony detection with large language models, achieving state-of-the-art zero-shot performance.
  • $(RSA)^2$, which presents a rhetorical-strategy-aware rational speech act framework for figurative language understanding, enabling human-compatible interpretations of non-literal utterances.
  • SoK: Evaluating Jailbreak Guardrails for Large Language Models, which provides a holistic analysis of jailbreak guardrails for LLMs and introduces a novel taxonomy and evaluation framework. These advancements demonstrate significant progress in addressing the challenges associated with LLMs and highlight the need for continued research in this area to ensure the safe and responsible development of these powerful technologies.

Sources

A Red Teaming Roadmap Towards System-Level Safety

CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

On the Ethics of Using LLMs for Offensive Security

$(RSA)^2$: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding

From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges

Evaluation empirique de la s\'ecurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vuln\'erabilit\'es par exp\'erimentations de jailbreaks

SoK: Evaluating Jailbreak Guardrails for Large Language Models

Quantifying Azure RBAC Wildcard Overreach

Built with on top of