Advances in Large Language Model Robustness and Security

The field of large language models is moving towards improving robustness and security. Researchers are exploring various methods to mitigate issues such as in-context reward hacking, memorization, and adversarial attacks. Noteworthy papers in this area include Specification Self-Correction, which introduces a novel framework for identifying and correcting flaws in guiding specifications, and Strategic Deflection, which presents a defense against logit manipulation attacks by producing semantically adjacent responses that neutralize harmful intent. Other notable works include SDD, which encourages models to produce high-quality but irrelevant responses to harmful prompts, and Adversarial Defence without Adversarial Defence, which enhances robustness via instance-level principal component removal.

Sources

Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Memorization in Fine-Tuned Large Language Models

Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing

SDD: Self-Degraded Defense against Malicious Fine-tuning

Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

Modelling Adjectival Modification Effects on Semantic Plausibility

Strategic Deflection: Defending LLMs from Logit Manipulation

Built with on top of