Evaluating and Improving Large Language Models in Legal and Financial Applications

The field of natural language processing is moving towards developing more robust and reliable evaluation methods for large language models (LLMs) in high-stakes applications such as financial disclosures and legal decision-making. Researchers are focusing on creating benchmarks and taxonomies to assess the performance of LLMs in these domains. Noteworthy papers in this area include: GRAB, a finance-specific benchmark for unsupervised topic discovery in financial disclosures, which enables reproducible and standardized comparison across different topic models. The Silent Judge, a study that exposes the shortcut bias in LLMs when used as judges to evaluate system outputs, highlighting the need for more faithful and transparent evaluation methods. Who's Your Judge, a work that proposes a task of judgment detection to distinguish LLM-generated judgments from human-generated ones, and introduces a lightweight neural detector to link LLM judges' biases with candidates' properties.

Sources

GRAB: A Risk Taxonomy--Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures

Who's Your Judge? On the Detectability of LLM-Generated Judgments

The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

Built with on top of