Advances in Language Model Security

The field of language model security is rapidly evolving, with a growing focus on protecting against various types of attacks, including model extraction attacks, prompt injection attacks, and jailbreak risks. Researchers are developing innovative defense mechanisms, such as integrated attack methodologies, adaptive defense mechanisms, and specialized metrics for evaluating attack effectiveness and defense performance. Noteworthy papers in this area include:

  • A Survey on Model Extraction Attacks and Defenses for Large Language Models, which provides a comprehensive taxonomy of attacks and defenses and proposes promising research directions.
  • STACK: Adversarial Attacks on LLM Safeguard Pipelines, which develops and evaluates a staged attack procedure against defense pipelines.
  • Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks, which introduces an open-source and open-weight LLM with built-in model-level defense.

Sources

A Survey on Model Extraction Attacks and Defenses for Large Language Models

STACK: Adversarial Attacks on LLM Safeguard Pipelines

SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents

Rethinking Broken Object Level Authorization Attacks Under Zero Trust Principle

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Built with on top of