Advances in Large Language Model Safety and Interpretability

The field of large language models (LLMs) is rapidly evolving, with a growing focus on safety and interpretability. Recent research has highlighted the importance of understanding how LLMs learn and represent knowledge, as well as the need to develop effective methods for detecting and preventing harmful behavior. One key area of research is the development of techniques for analyzing and interpreting LLM internals, such as concept-driven neuron attribution and activation transport operators. These methods have the potential to provide valuable insights into how LLMs work and how they can be improved. Another important area of research is the development of safety guardrails and defense mechanisms, such as speculative safety-aware decoding and prompt injection detection. These mechanisms can help to prevent LLMs from being used for malicious purposes and ensure that they are used in a responsible and beneficial way. Notable papers in this area include 'A Review of Developmental Interpretability in Large Language Models', which provides a comprehensive overview of the field of developmental interpretability, and 'NEAT: Concept driven Neuron Attribution in LLMs', which proposes a new method for locating significant neurons in LLMs. Overall, the field of LLM safety and interpretability is rapidly advancing, with new techniques and methods being developed to address the challenges and risks associated with these powerful models.

Sources

A Review of Developmental Interpretability in Large Language Models

NEAT: Concept driven Neuron Attribution in LLMs

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models

Time Series Based Network Intrusion Detection using MTF-Aided Transformer

LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

Activation Transport Operators

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

MalLoc: Toward Fine-grained Android Malicious Payload Localization via LLMs

Speculative Safety-Aware Decoding

Learning from Few Samples: A Novel Approach for High-Quality Malcode Generation

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Tricking LLM-Based NPCs into Spilling Secrets

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

Language Models Identify Ambiguities and Exploit Loopholes

The Art of Hide and Seek: Making Pickle-Based Model Supply Chain Poisoning Stealthy Again

FlowletFormer: Network Behavioral Semantic Aware Pre-training Model for Traffic Classification

Evaluating Language Model Reasoning about Confidential Information

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Smart Contract Intent Detection with Pre-trained Programming Language Model

IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

FlowMalTrans: Unsupervised Binary Code Translation for Malware Detection Using Flow-Adapter Architecture

Ransomware 3.0: Self-Composing and LLM-Orchestrated

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance