The field of natural language processing is moving towards developing more robust and reliable evaluation methods for large language models (LLMs) in high-stakes applications such as financial disclosures and legal decision-making. Researchers are focusing on creating benchmarks and taxonomies to assess the performance of LLMs in these domains. Noteworthy papers in this area include: GRAB, a finance-specific benchmark for unsupervised topic discovery in financial disclosures, which enables reproducible and standardized comparison across different topic models. The Silent Judge, a study that exposes the shortcut bias in LLMs when used as judges to evaluate system outputs, highlighting the need for more faithful and transparent evaluation methods. Who's Your Judge, a work that proposes a task of judgment detection to distinguish LLM-generated judgments from human-generated ones, and introduces a lightweight neural detector to link LLM judges' biases with candidates' properties.