The field of natural language processing is witnessing significant advancements in the alignment and evaluation of large language models (LLMs). Recent developments indicate a growing focus on improving the controllability and reliability of LLMs, with a particular emphasis on their ability to follow complex and fine-grained instructions. Researchers are exploring novel evaluation frameworks and benchmarks to assess the performance of LLMs in various tasks, including lexical instruction following, safety signal detection, and semantic similarity measurement. Furthermore, there is a increasing interest in developing multimodal judges that can follow diverse evaluation criteria and produce reliable judgments. Noteworthy papers in this area include:
- LexInstructEval, which introduces a new benchmark and evaluation framework for fine-grained lexical instruction following.
- Multi-Value Alignment, which proposes a novel framework for aligning LLMs with multiple human values.
- OpenGloss, which presents a synthetic encyclopedic dictionary and semantic knowledge graph for English.
- The Text Aphasia Battery, which introduces a clinically-grounded benchmark for assessing aphasic-like deficits in LLMs.
- Multi-Crit, which develops a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria.