Advances in Large Language Model Robustness and Security

The field of large language models is moving towards improving robustness and security. Researchers are exploring various methods to mitigate issues such as in-context reward hacking, memorization, and adversarial attacks. Noteworthy papers in this area include Specification Self-Correction, which introduces a novel framework for identifying and correcting flaws in guiding specifications, and Strategic Deflection, which presents a defense against logit manipulation attacks by producing semantically adjacent responses that neutralize harmful intent. Other notable works include SDD, which encourages models to produce high-quality but irrelevant responses to harmful prompts, and Adversarial Defence without Adversarial Defence, which enhances robustness via instance-level principal component removal.

Advances in Large Language Model Robustness and Security

Sources