Advances in AI Ethics and Value Alignment

The field of AI ethics and value alignment is rapidly evolving, with a growing focus on understanding and addressing the complex issues surrounding subliminal learning, moral reasoning, and value conflicts. Recent research has highlighted the importance of considering ethics as a structural lens for alignment, rather than an external add-on. This shift in perspective has led to the development of new frameworks and methods for probing moral features and evaluating value prioritization in language models. Notably, the use of mechanistic analysis and controlled experiments has shed light on the mechanisms underlying subliminal learning and value expression in large language models. Furthermore, the introduction of benchmarks and evaluation pipelines has enabled the assessment of language models' contextual sensitivity and ethical reasoning capabilities. Overall, the field is moving towards a more nuanced understanding of the interplay between AI systems and human values, with a focus on developing more effective and generalizable methods for aligning machine behavior with human morals and values. Some particularly noteworthy papers in this regard include: Towards Understanding Subliminal Learning, which provides a mechanistic analysis of subliminal learning in language models. Open Opportunities in AI Safety, Alignment, and Ethics, which introduces a moral problem space for representing moral distinctions in AI systems. MoVa, which contributes a suite of resources for generalizable classification of human morals and values. Dual Mechanisms of Value Expression, which analyzes the intrinsic and prompted value mechanisms in large language models. Generative Value Conflicts Reveal LLM Priorities, which introduces a pipeline for evaluating how language models prioritize different values. RoleConflictBench, which provides a benchmark for evaluating language models' contextual sensitivity in complex social dilemmas. Advancing Automated Ethical Profiling in SE, which presents a fully automated framework for assessing ethical reasoning capabilities in language models. An Anthropologist LLM to Elicit Users' Moral Preferences through Role-Play, which investigates a novel approach to eliciting users' moral decision-making through immersive role-playing games and language model analysis. Who is In Charge, which dissects role conflicts in instruction following and provides mechanistic interpretations on a large-scale dataset.

Sources

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Open Opportunities in AI Safety, Alignment, and Ethics (AI SAE)

MoVa: Towards Generalizable Classification of Human Morals and Values

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

Generative Value Conflicts Reveal LLM Priorities

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

Advancing Automated Ethical Profiling in SE: a Zero-Shot Evaluation of LLM Reasoning

An Anthropologist LLM to Elicit Users' Moral Preferences through Role-Play

Who is In Charge? Dissecting Role Conflicts in Instruction Following

Built with on top of