The field of natural language processing is moving towards more reliable and trustworthy evaluation methods for language models. Researchers are shifting their focus from traditional multiple choice benchmarks to more innovative approaches such as answer matching and human-centric operationalization of automated essay scoring systems. This change is driven by the need to assess the true capabilities of language models and to identify areas where they struggle, such as bias and robustness. Noteworthy papers include: LitBench, which introduces a benchmark and dataset for reliable evaluation of creative writing, Answer Matching Outperforms Multiple Choice for Language Model Evaluation, which shows that answer matching achieves near-perfect agreement with human grading. These developments are expected to contribute significantly to the advancement of language models and their applications in various fields.