Evaluating and Advancing Large Language Models

The field of large language models (LLMs) is rapidly evolving, with a focus on improving evaluation metrics and advancing their capabilities. Recent developments have highlighted the need for robust and scalable evaluation metrics that align closely with human judgment, as current metrics have limitations. Researchers are exploring the use of vision large language models (vLLMs) and other innovative approaches to address these challenges. Notably, novel evaluation frameworks and benchmarking methods are being proposed to assess the quality of generated text and images. Furthermore, studies are investigating the effectiveness of LLMs in various tasks, such as document ranking, relevance judgment, and text-to-image generation. Noteworthy papers include: Gen3DEval, which introduces a novel evaluation framework for text-to-3D generation. LMM4LMM, which presents a comprehensive dataset and benchmark for evaluating large-multimodal image generation. RankAlign, which proposes a ranking-based training method to close the generator-validator gap in LLMs. Validating LLM-Generated Relevance Labels for Educational Resource Search, which investigates the effectiveness of LLMs in evaluating domain-specific search results.

Sources

Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects

LLM for Comparative Narrative Analysis

LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs

Beyond Reproducibility: Advancing Zero-shot LLM Reranking Efficiency with Setwise Insertion

RankAlign: A Ranking View of the Generator-Validator Gap in Large Language Models

A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment

Benchmarking LLM-based Relevance Judgment Methods

Validating LLM-Generated Relevance Labels for Educational Resource Search

Built with on top of