Advances in Language Model Interpretability and Robustness

The field of natural language processing is moving towards a deeper understanding of language models, with a focus on interpretability and robustness. Recent studies have explored the vulnerabilities of large language models to misinformation and the importance of monitoring their factual integrity. The use of causal masking on spatial data has also been investigated, with promising results. Furthermore, research has delved into the periodicity of information in natural language, the monitorability of chain-of-thought outputs, and the legibility of reasoning models. Noteworthy papers include 'Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning', which introduces a framework for probing belief dynamics in continually trained language models, and 'Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models', which demonstrates the viability of applying causal masking to spatial data.

Sources

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models

Identifying the Periodicity of Information in Natural Language

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Reasoning Models Sometimes Output Illegible Chains of Thought

Thought Branches: Interpreting LLM Reasoning Requires Resampling

ParaScopes: What do Language Models Activations Encode About Future Text?

Reversal Invariance in Autoregressive Language Models

Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

Priors in Time: Missing Inductive Biases for Language Model Interpretability

Accumulating Context Changes the Beliefs of Language Models

LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS

Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and Interpretation

Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics

Addressing divergent representations from causal interventions on neural networks