Advances in Controlling Language Models

The field of natural language processing is witnessing significant developments in controlling language models, with a focus on improving their steerability, safety, and reliability. Researchers are exploring innovative methods to evaluate and enhance the ability of large language models to produce outputs aligned with user goals, while minimizing undesirable side effects. A key direction is the development of frameworks and techniques that can systematically assess and improve the steerability of language models, including the use of multi-dimensional goal spaces and geometric approaches to safety. Another important area of research is the development of effective methods for controlling multiple behavioral attributes in language models, such as tone, sentiment, and toxicity, without compromising text quality. Noteworthy papers in this area include:

  • A Course Correction in Steerability Evaluation, which introduces a framework for evaluating the steerability of language models and highlights the limitations of current models.
  • Learning Safety Constraints for Large Language Models, which proposes a geometric approach to safety that learns and enforces multiple safety constraints directly in the model's representation space.

Sources

A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models

Learning Safety Constraints for Large Language Models

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization

Spegion: Implicit and Non-Lexical Regions with Sized Allocations

HyperSteer: Activation Steering at Scale with Hypernetworks

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

Built with on top of