Advances in Large Language Model Alignment and Evaluation

The field of natural language processing is witnessing significant advancements in the alignment and evaluation of large language models (LLMs). Recent developments indicate a growing focus on improving the controllability and reliability of LLMs, with a particular emphasis on their ability to follow complex and fine-grained instructions. Researchers are exploring novel evaluation frameworks and benchmarks to assess the performance of LLMs in various tasks, including lexical instruction following, safety signal detection, and semantic similarity measurement. Furthermore, there is a increasing interest in developing multimodal judges that can follow diverse evaluation criteria and produce reliable judgments. Noteworthy papers in this area include:

  • LexInstructEval, which introduces a new benchmark and evaluation framework for fine-grained lexical instruction following.
  • Multi-Value Alignment, which proposes a novel framework for aligning LLMs with multiple human values.
  • OpenGloss, which presents a synthetic encyclopedic dictionary and semantic knowledge graph for English.
  • The Text Aphasia Battery, which introduces a clinically-grounded benchmark for assessing aphasic-like deficits in LLMs.
  • Multi-Crit, which develops a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria.

Sources

LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation

OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Knowledge-based Graphical Method for Safety Signal Detection in Clinical Trials

Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

Scaling Item-to-Standard Alignment with Large Language Models: Accuracy, Limits, and Solutions

Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity

The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

On Evaluating LLM Alignment by Evaluating LLMs as Judges

Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Built with on top of