Advances in Large Language Model Safety and Interpretability

The field of large language models (LLMs) is rapidly evolving, with a growing focus on safety and interpretability. Recent research has highlighted the importance of developing methods to prevent emergent misalignment in LLMs, which can occur when fine-tuning these models for specific tasks. Several studies have proposed innovative solutions to address this issue, including the use of regularization techniques, safe subspace projection, and gradient surgery. Additionally, there is a increasing interest in developing more interpretable LLMs, with techniques such as sparse autoencoders and feature visualization being explored. Noteworthy papers in this area include In-Training Defenses against Emergent Misalignment in Language Models, which presents a systematic study of in-training safeguards against emergent misalignment, and Structural Equation-VAE, which introduces a novel architecture for disentangled latent representations in tabular data. Other notable papers include MASteer, which proposes a framework for trustworthiness repair in LLMs, and Gradient Surgery for Safe LLM Fine-Tuning, which introduces a novel method for resolving conflicting gradients in LLM fine-tuning.

Sources

In-Training Defenses against Emergent Misalignment in Language Models

Structural Equation-VAE: Disentangled Latent Representations for Tabular Data

MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair

Gradient Surgery for Safe LLM Fine-Tuning

Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Representation Understanding via Activation Maximization

Classifier Language Models: Unifying Sparse Finetuning and Adaptive Tokenization for Specialized Classification Tasks

Interpretable Reward Model via Sparse Autoencoder

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Built with on top of