The field of large language models is moving towards developing more fine-grained control over model behavior, with a focus on safety and security. Recent research has explored methods for auditing model dispositions, dissecting refusal behaviors, and achieving targeted interventions. These advancements have the potential to enable more controlled deployment of large language models in security-sensitive domains. Noteworthy papers in this area include:
- Stated Preference for Interaction and Continued Engagement (SPICE), which introduces a diagnostic signal for evaluating a model's willingness to re-engage with a user.
- Beyond I'm Sorry, I Can't, which dissects large language model refusal using sparse autoencoders and demonstrates causal influence on refusal behaviors.
- MEUV, which achieves fine-grained capability activation in large language models via mutually exclusive unlock vectors.
- RepIt, which enables precise interventions by isolating concept-specific representations.
- Enterprise AI Must Enforce Participant-Aware Access Control, which highlights the need for rigorous access control in enterprise settings to prevent data leakage.
- ReCoVeR, which proposes a novel approach for reducing language confusion in multilingual large language models.