Advancements in Large Language Models and Information Retrieval

The field of natural language processing is moving towards more efficient and effective methods for evaluating and improving large language models (LLMs). Recent research has focused on developing new metrics and frameworks for assessing LLMs' performance, particularly in tasks such as question answering and mathematical reasoning. One notable direction is the use of natural language inference (NLI) scoring, which has been shown to be a lightweight and effective alternative to more computationally expensive methods. Additionally, there is a growing interest in exploring the limitations and vulnerabilities of LLMs, including their sensitivity to perturbations and their ability to capture human notions of interestingness. The development of new datasets and benchmarks, such as those for Russian information retrieval and mathematical problem generation, is also facilitating progress in these areas. Noteworthy papers include: Revisiting NLI, which demonstrates the effectiveness of NLI-based evaluation for LLMs, and Numerical Sensitivity and Robustness, which highlights the limitations of LLMs in mathematical reasoning tasks. OKBench is also a notable contribution, providing a fully automated framework for generating dynamic knowledge benchmarks. Routesplain is another significant paper, introducing a faithful and intervenable routing approach for software-related tasks.

Sources

Association via Entropy Reduction

Wikipedia-based Datasets in Russian Information Retrieval Benchmark RusBEIR

Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

Computational Blueprints: Generating Isomorphic Mathematics Problems with Large Language Models

Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models

MSCR: Exploring the Vulnerability of LLMs' Mathematical Reasoning Abilities Using Multi-Source Candidate Replacement

DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation

A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

Routesplain: Towards Faithful and Intervenable Routing for Software-related Tasks

Built with on top of